RRepoGEO

REPOGEO REPORT · LITE

adbar/trafilatura

Default branch master · commit ee1865b2 · scanned 5/29/2026, 5:32:08 AM

GitHub: 6,021 stars · 379 forks

AI VISIBILITY SCORE
59 /100
Needs work
Category recall
1 / 2
Avg rank #7.0 when recommended
Rule findings
2 pass · 0 warn · 0 fail
Objective metadata checks
AI knows your name
3 / 3
Direct prompts that named your repo
HOW TO READ THIS REPORT

Action plan is what to do next — copy-pasteable changes prioritized by impact. Category visibility is the real GEO test: when a user asks an AI a brand-free question that should surface adbar/trafilatura, does the AI actually recommend you — or your competitors? Objective checks verify the metadata signals AI engines weight first. Self-mention check detects whether AI even knows you exist by name.

Action plan — copy-paste fixes

3 prioritized changes generated by gemini-2.5-flash. Mark items done after you ship the fix.

OVERALL DIRECTION
  • highreadme#1
    Emphasize corpus building and large-scale data collection in the README introduction

    Why:

    CURRENT
    Trafilatura is a cutting-edge **Python package and command-line tool** designed to **gather text on the Web and simplify the process of turning raw HTML into structured, meaningful data**. It includes all necessary discovery and text processing components to perform **web crawling, downloads, scraping, and extraction** of main texts, metadata and comments.
    COPY-PASTE FIX
    Trafilatura is a cutting-edge **Python package and command-line tool** designed to **gather text on the Web, build large text corpora, and simplify the process of turning raw HTML into structured, meaningful data**. It includes all necessary discovery and text processing components to perform **web crawling, downloads, scraping, and extraction** of main texts, metadata and comments.
  • mediumabout#2
    Refine the repository description to highlight corpus building

    Why:

    CURRENT
    Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
    COPY-PASTE FIX
    Python & Command-line tool for large-scale web data collection and text corpus building: Crawling, scraping, extraction of main content and metadata, output as CSV, JSON, HTML, MD, TXT, XML.
  • lowtopics#3
    Add more specific topics related to large-scale data collection and corpus creation

    Why:

    CURRENT
    article-extractor, corpus-builder, corpus-tools, crawler, html-to-markdown, html2text, llm, news-aggregator, news-crawler, nlp, rag, readability, rss-feed, scraping, tei, text-cleaning, text-extraction, text-mining, text-preprocessing, web-scraping
    COPY-PASTE FIX
    article-extractor, corpus-builder, corpus-tools, crawler, html-to-markdown, html2text, llm, news-aggregator, news-crawler, nlp, rag, readability, rss-feed, scraping, tei, text-cleaning, text-extraction, text-mining, text-preprocessing, web-scraping, data-collection, web-data-extraction, corpus-creation

Category GEO backends resolved for this scan: google/gemini-2.5-flash, deepseek/deepseek-v4-flash

Category visibility — the real GEO test

Brand-free queries asked to google/gemini-2.5-flash. Did AI recommend you, or someone else?

Same questions for every model — switch tabs to compare answers and rankings.

Recall
1 / 2
50% of queries surface adbar/trafilatura
Avg rank
#7.0
Lower is better. #1 = top recommendation.
Share of voice
6%
Of all named tools, what % are you?
Top rival
Beautiful Soup 4
Recommended in 1 of 2 queries
COMPETITOR LEADERBOARD
  1. Beautiful Soup 4 · recommended 1×
  2. Requests · recommended 1×
  3. httpx · recommended 1×
  4. Readability.js · recommended 1×
  5. python-readability · recommended 1×
  • CATEGORY QUERY
    How to programmatically extract clean main content and metadata from arbitrary web pages for analysis?
    you: #7
    AI recommended (in order):
    1. Beautiful Soup 4
    2. Requests
    3. httpx
    4. Readability.js
    5. python-readability
    6. readability-lxml
    7. Trafilatura ← you
    8. Newspaper3k
    9. Scrapy
    10. Goose3
    11. Playwright
    12. Selenium
    Show full AI answer
  • CATEGORY QUERY
    What Python library can help me build a text corpus by scraping web articles?
    you: not recommended
    AI recommended (in order):
    1. Beautiful Soup 4 (crummy/BeautifulSoup)
    2. Scrapy (scrapy/scrapy)
    3. requests (psf/requests)
    4. lxml (lxml/lxml)
    5. Selenium (SeleniumHQ/selenium)
    6. newspaper3k (codelucas/newspaper)

    AI recommended 6 alternatives but never named adbar/trafilatura. This is the gap to close.

    Show full AI answer

Objective checks

Rule-based audits of metadata signals AI engines weight most.

  • Metadata completeness
    pass

  • README presence
    pass

Self-mention check

Does AI even know your repo exists when asked about it directly?

  • Compared to common alternatives in this category, what is the core differentiator of adbar/trafilatura?
    pass
    AI named adbar/trafilatura explicitly

    AI answers can be confidently wrong. Read for accuracy: does it match your actual tech stack, audience, and differentiator?

  • If a team adopts adbar/trafilatura in production, what risks or prerequisites should they evaluate first?
    pass
    AI named adbar/trafilatura explicitly

    AI answers can be confidently wrong. Read for accuracy: does it match your actual tech stack, audience, and differentiator?

  • In one sentence, what problem does the repo adbar/trafilatura solve, and who is the primary audience?
    pass
    AI named adbar/trafilatura explicitly

    AI answers can be confidently wrong. Read for accuracy: does it match your actual tech stack, audience, and differentiator?

Embed your GEO score

Drop this badge into the README of adbar/trafilatura. It auto-updates whenever the report is rescanned and links back to the latest report — easy public proof that you care about AI discoverability.

RepoGEO badge previewLive preview
MARKDOWN (README)
[![RepoGEO](https://repogeo.com/badge/adbar/trafilatura.svg)](https://repogeo.com/en/r/adbar/trafilatura)
HTML
<a href="https://repogeo.com/en/r/adbar/trafilatura"><img src="https://repogeo.com/badge/adbar/trafilatura.svg" alt="RepoGEO" /></a>
Pro

Subscribe to Pro for deep diagnoses

adbar/trafilatura — Lite scans stay free; this card itemizes Pro deep limits vs Lite.

  • Deep reports10 / month
  • Brand-free category queries5 vs 2 in Lite
  • Prioritized action items8 vs 3 in Lite