RRepoGEO

REPOGEO REPORT · LITE

tjunlp-lab/Awesome-LLMs-Evaluation-Papers

Default branch main · commit a4895bc1 · scanned 6/11/2026, 9:18:30 AM

GitHub: 803 stars · 62 forks

AI VISIBILITY SCORE
17 /100
Critical
Category recall
0 / 2
Not recommended in any query
Rule findings
1 pass · 0 warn · 1 fail
Objective metadata checks
AI knows your name
1 / 3
Direct prompts that named your repo
HOW TO READ THIS REPORT

Action plan is what to do next — copy-pasteable changes prioritized by impact. Category visibility is the real GEO test: when a user asks an AI a brand-free question that should surface tjunlp-lab/Awesome-LLMs-Evaluation-Papers, does the AI actually recommend you — or your competitors? Objective checks verify the metadata signals AI engines weight first. Self-mention check detects whether AI even knows you exist by name.

Action plan — copy-paste fixes

2 prioritized changes generated by gemini-2.5-flash. Mark items done after you ship the fix.

OVERALL DIRECTION
  • highreadme#1
    Add a clear introductory sentence to the README

    Why:

    CURRENT
    # Awesome LLMs Evaluation Papers :bookmark_tabs:
    COPY-PASTE FIX
    # Awesome LLMs Evaluation Papers :bookmark_tabs:
    
    This repository provides a curated and continuously updated list of research papers focused on the evaluation of Large Language Models (LLMs), organized according to our comprehensive survey.
  • highlicense#2
    Add a LICENSE file to the repository

    Why:

    COPY-PASTE FIX
    Add a LICENSE file to the repository root containing the full text of the Creative Commons Attribution 4.0 International (CC BY 4.0) License.

Category GEO backends resolved for this scan: google/gemini-2.5-flash, deepseek/deepseek-v4-flash

Category visibility — the real GEO test

Brand-free queries asked to google/gemini-2.5-flash. Did AI recommend you, or someone else?

Same questions for every model — switch tabs to compare answers and rankings.

Recall
0 / 2
0% of queries surface tjunlp-lab/Awesome-LLMs-Evaluation-Papers
Avg rank
Lower is better. #1 = top recommendation.
Share of voice
0%
Of all named tools, what % are you?
Top rival
GPT-4
Recommended in 2 of 2 queries
COMPETITOR LEADERBOARD
  1. GPT-4 · recommended 2×
  2. Scale AI · recommended 1×
  3. Appen · recommended 1×
  4. Streamlit · recommended 1×
  5. Gradio · recommended 1×
  • CATEGORY QUERY
    What are the current best practices for evaluating large language model performance?
    you: not recommended
    AI recommended (in order):
    1. Scale AI
    2. Appen
    3. Streamlit
    4. Gradio
    5. GPT-4
    6. Claude 3 Opus
    7. LangChain
    8. LlamaIndex
    9. Arize AI
    10. Weights & Biases
    11. EleutherAI's LM Evaluation Harness (lm-eval)
    12. Open LLM Leaderboard (Hugging Face)
    13. Ragas
    14. ROUGE
    15. NLTK
    16. Hugging Face Datasets
    17. BLEU
    18. BERTScore

    AI recommended 18 alternatives but never named tjunlp-lab/Awesome-LLMs-Evaluation-Papers. This is the gap to close.

    Show full AI answer
  • CATEGORY QUERY
    Need a comprehensive overview of research papers on assessing large language models.
    you: not recommended
    AI recommended (in order):
    1. HELM Benchmark
    2. SuperGLUE
    3. GPT-4
    4. MATH Dataset
    5. GSM8K
    6. BigBench-Hard
    7. TruthfulQA
    8. Gopher
    9. StereoSet
    10. Chinchilla
    11. arXiv
    12. ACL
    13. EMNLP
    14. NAACL

    AI recommended 14 alternatives but never named tjunlp-lab/Awesome-LLMs-Evaluation-Papers. This is the gap to close.

    Show full AI answer

Objective checks

Rule-based audits of metadata signals AI engines weight most.

  • Metadata completeness
    fail

    Suggestion:

  • README presence
    pass

Self-mention check

Does AI even know your repo exists when asked about it directly?

  • Compared to common alternatives in this category, what is the core differentiator of tjunlp-lab/Awesome-LLMs-Evaluation-Papers?
    pass
    AI did not name tjunlp-lab/Awesome-LLMs-Evaluation-Papers — likely talking about a different project

    AI answers can be confidently wrong. Read for accuracy: does it match your actual tech stack, audience, and differentiator?

  • If a team adopts tjunlp-lab/Awesome-LLMs-Evaluation-Papers in production, what risks or prerequisites should they evaluate first?
    pass
    AI named tjunlp-lab/Awesome-LLMs-Evaluation-Papers explicitly

    AI answers can be confidently wrong. Read for accuracy: does it match your actual tech stack, audience, and differentiator?

  • In one sentence, what problem does the repo tjunlp-lab/Awesome-LLMs-Evaluation-Papers solve, and who is the primary audience?
    pass
    AI did not name tjunlp-lab/Awesome-LLMs-Evaluation-Papers — likely talking about a different project

    AI answers can be confidently wrong. Read for accuracy: does it match your actual tech stack, audience, and differentiator?

Embed your GEO score

Drop this badge into the README of tjunlp-lab/Awesome-LLMs-Evaluation-Papers. It auto-updates whenever the report is rescanned and links back to the latest report — easy public proof that you care about AI discoverability.

RepoGEO badge previewLive preview
MARKDOWN (README)
[![RepoGEO](https://repogeo.com/badge/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.svg)](https://repogeo.com/en/r/tjunlp-lab/Awesome-LLMs-Evaluation-Papers)
HTML
<a href="https://repogeo.com/en/r/tjunlp-lab/Awesome-LLMs-Evaluation-Papers"><img src="https://repogeo.com/badge/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.svg" alt="RepoGEO" /></a>
Pro

Subscribe to Pro for deep diagnoses

tjunlp-lab/Awesome-LLMs-Evaluation-Papers — Lite scans stay free; this card itemizes Pro deep limits vs Lite.

  • Deep reports10 / month
  • Brand-free category queries5 vs 2 in Lite
  • Prioritized action items8 vs 3 in Lite