methodology · the substance, not the score

How Docs Lens reads your docs.

Three reader profiles, each grounded in named real products. Deterministic checks only. No LLMs in the scoring path. Every claim ties to a real HTTP response, a release note, or a tool's output a skeptic can re-run.

The three reader profiles

Pick a docs URL, we run all three in parallel.

A bare GET with no JavaScript execution. The HTML is run through Turndown to produce markdown, and <style> / <script> tags are stripped before extraction.

Real consumers

Full Chromium, full JavaScript and CSS execution. Reads the post-render DOM. Local dev runs Playwright; hosted deployments proxy through Jina Reader (Puppeteer + Chrome under the hood).

Real consumers
  • ChatGPT Atlas

    Full Chromium via OpenAI's OWL layer. Reads from the rendered DOM.

    OpenAI: Building ChatGPT Atlas
  • Perplexity Comet

    Chromium-based browser, full JS rendering.

    Comet engine
  • Cline + Roo Code

    Puppeteer-based browser tool, captures DOM and screenshots. Roo also supports Playwright-MCP.

    cline/cline
  • Aider (with Playwright)

    Switches to Chromium when aider install-playwright has been run.

  • Bingbot / Googlebot WRS

    Index-side renderer used by Bing Chat, Microsoft Copilot, GitHub Copilot Chat web grounding, Gemini URL context. Same render pipeline — but the agent reads what was indexed last week, not what is on your page right now.

    Vercel — The rise of the AI crawler

Never fetches the full page. Receives a ranked snippet — title, meta description, OG tags, first H1, the first ~200 characters of visible text, and any JSON-LD.

Real consumers
  • ChatGPT Search

    Bing-grounded snippets. The agent reads ranked results, not full pages.

  • Anthropic API web_search

    Brave-backed snippets. Full bodies require a separate web_fetch call.

    Anthropic web_search docs
  • Perplexity (declared bot)

    Snippet-first; full body fetched separately when needed.

    Perplexity crawlers
  • Cursor @web / @docs

    Chunked-and-embedded crawl. Returns embedding chunks rather than pages.

  • Phind, You.com, GitHub Copilot Chat

    Web grounding via Bing or comparable snippet APIs.

SECTION 02

What we check

Deterministic checks only. No LLMs in the scoring path. Every check is implemented in src/lib/checks/ and re-runnable by hand against your URL.

Discoverability

/llms.txt manifest, sitemap, /.well-known/ endpoints, link-header api-catalog

Content accessibility

auth gates, redirect behaviour, HTTP status codes, soft 404s

Page size & truncation

HTML byte cap, markdown byte cap, content start position

Content structure

heading hierarchy, code-fence validity, JSON validity, internal-link integrity

Rendering & extraction

client-side rendering, tabbed content serialization, image alt text

Metadata completeness

title, meta description, OG tags, JSON-LD, robots policy

SECTION 03

What we don't measure

We don't lint your prose. Whether your sentences are too long or your tone is too passive doesn't change whether an agent can read the page. Tools like Vale and alex exist for that. We also don't run lighthouse audits, accessibility audits beyond what is directly relevant to agent extraction, or SEO checks beyond the metadata that snippet readers consume.

SECTION 04

The grade

Every scan can produce a 0-100 score and a letter grade. We deliberately demote it to a footer-level affordance — it exists for shareability, not for headline. The headline of your scan is the one-sentence verdict, not a number. The number is in the agent-fix prompt because the prompt has to be self-contained when pasted elsewhere.

SECTION 05

Every check, in detail

One entry per check. Each scan-result FAIL/WARN row links here so you can ground the verdict in its rubric definition. Anchor URLs are stable — bookmark, share, or link from your own docs review.

Agent retrieval

21 checks

Whether coding agents (Claude Code, Cursor, Continue) can read this site.

.md URL variants

#markdown-url-support

Coding agents (Claude Code, Cursor, Continue) try the .md variant first to skip Turndown. If you only serve .html they have to convert lossy.

Typical fix

Configure your docs platform to serve pages at equivalent .md URLs (e.g. /docs/quickstart.md).

A2A agent card

#a2a-agent-card

Agent-to-Agent (A2A) discovery lets other agents find your service as a callable agent and learn its capabilities programmatically. Optional unless your product surfaces agent-like functionality.

Typical fix

If your product itself acts as an agent, publish /.well-known/agent-card.json.

Accept: text/markdown handling

#content-negotiation

Modern agent fetchers (Claude Code as of v2.1.105) send Accept: text/markdown. If your server ignores it and returns HTML anyway, you lose the markdown shortcut.

Typical fix

Return markdown when the request has Accept: text/markdown.

Agent Skills discovery (/.well-known/agent-skills.json)

#well-known-agent-skills

Skills are reusable agent capabilities scoped to your product. The well-known endpoint is how agents find them without manual config.

Typical fix

Publish /.well-known/agent-skills.json so coding agents can discover task-specific skills.

Auth alternative access path

#auth-alternative-access

When auth is non-negotiable, give agents a path: a public mirror, an anonymous-read endpoint, or a documented API key the user can configure.

Typical fix

If auth is required, document the public alternative (anonymous read endpoint, mirror site).

Auth gate detection

#auth-gate-detection

Auth-gated docs are invisible to every agent. If a portion needs to be gated, expose an auth-free public version or robots-index allowed subset.

Typical fix

Ensure docs pages return 200 without requiring login cookies or tokens.

Cache validators (Last-Modified / ETag)

#cache-headers

Cache validators let agents skip the body when nothing changed (304 Not Modified). Without them, every fetch refetches the full response, which is wasteful for the agent's budget and your bandwidth.

Typical fix

Send Last-Modified or ETag headers on docs responses.

HTTP status code correctness

#http-status-codes

Agents trust status codes. A page that returns 200 but says 'not found' wastes the agent's budget on dead content.

Typical fix

Return the right status code. Soft 404s (200 + 'page not found' body) confuse agents.

llms.txt discovery directive

#llms-txt-directive

Even if /llms.txt exists, agents only know to look for it if your homepage points to it (`<link rel="llms.txt" href="/llms.txt">` or an HTTP `Link` header).

Typical fix

Add an llms.txt directive (link or HTTP header) so agents can find it without guessing the path.

llms.txt format validity

#llms-txt-valid

Agents parse llms.txt as a structured manifest. Malformed lines cause the entire file to be discarded silently, undoing the discoverability win of having one.

Typical fix

Fix /llms.txt formatting. The file exists but has malformed entries that agents can't parse.

llms.txt freshness

#llms-txt-freshness

A stale llms.txt that lists removed pages or misses new ones is worse than none, since agents trust the manifest and stop looking elsewhere. Wire generation into your build pipeline so it can never drift.

Typical fix

Regenerate /llms.txt on every docs deploy so it reflects the current page set.

llms.txt manifest

#llms-txt-exists

An /llms.txt file (per the llmstxt.org spec) gives coding agents a manifest of your documentation. Without it, agents like Claude Code and Cursor have no shortcut to your structured docs and must crawl the whole site to find anything.

Typical fix

Create /llms.txt following https://llmstxt.org, listing all doc pages in markdown format.

llms.txt size budget

#llms-txt-size

Tighter agent fetchers (MCP defaults to 5 KB, Cursor WebFetch to 28 KB, Claude Code truncates at 100 KB) only see the opening of an oversized llms.txt. Use the progressive-disclosure pattern: a small root file pointing to /docs/section/llms.txt files.

Typical fix

Keep /llms.txt under 50 KB. Split into nested section-level llms.txt files if it grows past that.

MCP server discovery (/.well-known/mcp.json)

#well-known-mcp-card

MCP-aware agents check /.well-known/mcp.json for capability metadata. Without it, even users who would benefit from your MCP server never get pointed to it.

Typical fix

Publish /.well-known/mcp.json so agents can discover your MCP server.

OAuth / OIDC discovery

#oauth-discovery

OAuth/OIDC discovery metadata lets coding agents programmatically obtain access tokens. Without it, the agent has to be hand-configured with your auth endpoints, and most won't bother.

Typical fix

Publish /.well-known/openid-configuration or /.well-known/oauth-authorization-server so agents can authenticate against your APIs.

OAuth protected resource metadata

#oauth-protected-resource

RFC 9728 metadata describing your protected resource. Coding agents read this to discover which OAuth issuer they need to talk to before calling your API.

Typical fix

Publish /.well-known/oauth-protected-resource so agents know which authorization servers issue tokens for your APIs.

Redirect behavior

#redirect-behavior

JS or meta-refresh redirects break for agents that don't render. Server-side 3xx redirects work for everyone.

Typical fix

Use 301/308 for permanent redirects, 302/307 for temporary. Avoid HTML meta-refresh.

Server-side rendering

#rendering-strategy

Agents that don't run JavaScript (Claude Code WebFetch, Continue, Aider default) only see the initial HTML. CSR-only docs look empty to them.

Typical fix

Server-render or static-export your docs. Client-side rendering hides content from raw HTTP fetchers.

Generative engine

8 checks

Whether answer engines (ChatGPT Search, Perplexity, You.com) surface this site.

AI bot rules in robots.txt

#ai-bot-rules

Without explicit User-agent rules, AI crawlers fall back to platform defaults that vary widely and may not match your intent. Naming each crawler explicitly puts you in control of who can crawl, who can train, and who can answer-engine-cite.

Typical fix

Add explicit User-agent stanzas in /robots.txt for GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, and Google-Extended.

API catalog discovery (/.well-known/api-catalog)

#well-known-api-catalog

An api-catalog points agents at your API definitions. Without it, agents either guess paths or skip your API entirely.

Typical fix

Publish /.well-known/api-catalog so agents can locate your OpenAPI/AsyncAPI specs.

Page metadata completeness

#metadata-completeness

Search-snippet readers (ChatGPT Search, Perplexity, You.com) only see your title and description. Missing metadata means missing citations.

Typical fix

Add `<title>`, `<meta name="description">`, and OG tags to every doc page.

robots.txt AI crawler access

#robots-txt

If robots.txt blocks agent crawlers, you're invisible to ChatGPT, Perplexity, and Claude.ai search even if your docs are otherwise perfect.

Typical fix

Allow major AI crawlers (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot) in /robots.txt.

sitemap.xml

#sitemap

A sitemap is the canonical index for crawlers and answer engines. Without one, ChatGPT Search, Perplexity, and Google AI Overviews fall back to following links from the homepage and miss pages that aren't navigable from there.

Typical fix

Generate /sitemap.xml listing every public docs URL.

Structured data (JSON-LD / Schema.org)

#content-signals

Answer engines like Perplexity heavily favor structured-data-rich pages because the citation surface is more reliable.

Typical fix

Add JSON-LD or Schema.org structured data so answer engines can cite specific sections.

Web Bot Auth signing keys

#web-bot-auth

Web Bot Auth (RFC 9421) lets origin servers cryptographically verify that an inbound request really comes from a declared bot rather than a spoofed user-agent string. The signing-keys directory is the discovery endpoint.

Typical fix

If you run a bot that signs requests, publish /.well-known/http-message-signatures-directory.

Context management

9 checks

Whether this site fits efficiently into an agent's context window per page.

Code block language tags

#markdown-code-fence-validity

Code blocks without language tags lose their type when converted to markdown, hurting both rendering and agent parsing of examples.

Typical fix

Wrap every code block in triple backticks with a language tag (```ts, ```python).

Content start position

#content-start-position

Many agent extractors heuristically prioritize early content. Pages where the actual answer starts past the halfway mark often get truncated before reaching it.

Typical fix

Move the main content within the first 50% of the page. Push nav and announcements below.

Heading hierarchy

#heading-hierarchy

Agents use heading structure to chunk pages. Skipping levels (H1 → H4) or having multiple H1s makes chunking unreliable.

Typical fix

Use exactly one H1 per page, then H2 for sections, H3 for subsections. No level skips.

HTML page size budget

#page-size-html

Headless agent fetchers cap fetch size. A page that's mostly nav and JS exhausts the budget before reaching the content.

Typical fix

Reduce nav boilerplate, inline scripts, and repetitive markup. Aim for under 1MB HTML.

Image alt text coverage

#image-alt-coverage

Diagram images without alt text are invisible to non-vision agents. Even short alts ('Architecture: API → DB → Cache') help.

Typical fix

Add alt text to every doc image. Agents and screen readers depend on it.

JSON code block validity

#json-code-block-validity

Agents often try to JSON.parse() example payloads. Invalid samples cause silent failures or fallback to text-only answers.

Typical fix

Make sure JSON code blocks are valid JSON. Trailing commas and comments break agent parsing.

Markdown page size budget

#page-size-markdown

Claude Code's pipeline truncates at 100KB of markdown before any further processing. Long pages get cut mid-content.

Typical fix

Trim the markdown output. Claude Code WebFetch caps at 100KB of markdown, and anything past that is lost.

Tabbed content serialization

#tabbed-content-serialization

JS-only tab widgets are invisible to raw HTTP fetchers. Either render all tab content in the source HTML, or serve a single linear version for non-JS readers.

Typical fix

Render every tab variant in HTML, not via JS. Agents without browsers see only the first tab.

SECTION 06 · OPEN SOURCE

The whole scoring path is open. Re-run any number we show you.

$npx afdocs check <url> --fixes --verbose
Paste a URL →