Overview

An llms.txt file tells AI crawlers what your site contains, which pages to prioritize, and how they relate. Like a sitemap, it drifts from the real site over time: pages are added without updating llms.txt, URLs change, descriptions go stale. This guide uses Claude to validate the format, flag missing pages, and surface stale descriptions. The spec lives in llms-txt.

Prerequisites

  • A site with a live /llms.txt. If not, follow ship-llms-txt first.
  • curl on the path.
  • An Anthropic API key, or Claude Code open in the project.
  • Optional: a list of all current page slugs (a sitemap XML or a directory listing).

Steps

1. Fetch the current llms.txt

curl -s https://example.com/llms.txt > llms-current.txt
wc -l llms-current.txt   # sanity check: expect at least 10 lines

If the file returns a 404, the file is missing or misconfigured. See add-llms-txt-to-existing-site for the setup steps.

2. Gather the list of actual pages

For a static site, list the content files:

find content/ -name "*.md" ! -name "index.md" | sort > pages-actual.txt

For a deployed site with a sitemap:

curl -s https://example.com/sitemap.xml \
  | grep -oP '(?<=<loc>)[^<]+' \
  | sort > pages-actual.txt

The diff between pages-actual.txt and llms-current.txt identifies missing or stale entries.

3. Build the validation prompt

A precise prompt produces actionable output. Use a system prompt that specifies the expected format, then pass the file as context.

import anthropic
 
client = anthropic.Anthropic()
 
with open("llms-current.txt") as f:
    llms_content = f.read()
 
with open("pages-actual.txt") as f:
    pages = f.read()
 
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=2048,
    system="""You are an llms.txt format validator.
The llms.txt spec requires:
- A top-level H1 with the site name.
- One or more H2 sections grouping related pages.
- Each page listed as: - [Title](URL): one-sentence description.
- Descriptions must be 10-30 words and agent-routing hints (what will I find here).
- No duplicate URLs.
- All URLs must be absolute and end without a trailing slash.
 
Return a JSON object with keys:
  format_errors: list of format violations (string each)
  missing_pages: list of pages in the actual-pages list not found in llms.txt
  stale_descriptions: list of entries whose description is vague or empty
  summary: one sentence
""",
    messages=[{
        "role": "user",
        "content": f"<llms_txt>\n{llms_content}\n</llms_txt>\n\n<actual_pages>\n{pages}\n</actual_pages>"
    }]
)
 
print(response.content[0].text)

See prompt-design for structuring validation prompts. The system prompt here follows the system-prompts pattern of specifying output format explicitly.

4. Review and apply fixes

Parse the JSON output and work through each category:

  • Format errors: fix the llms.txt file directly. Common issues: descriptions over one sentence, missing H1, relative URLs.
  • Missing pages: add an entry for each missing page with a routing-hint description. A description should answer “what will an agent find on this page?“.
  • Stale descriptions: rewrite any description that is vague (“General info about X”) or empty.

Apply fixes to your local content/llms.txt or wherever the file is generated.

5. Re-run Claude to confirm zero findings

After applying fixes, re-validate:

# If the site is local, build first
npx quartz build
curl -s http://localhost:8080/llms.txt > llms-fixed.txt

Run the same script against llms-fixed.txt. The JSON output should have empty format_errors, missing_pages, and stale_descriptions arrays.

6. Add to CI to catch drift

Wire a lightweight check in CI that counts entries versus pages:

#!/bin/bash
PAGES=$(find content/ -name "*.md" ! -name "index.md" | wc -l)
ENTRIES=$(grep -c "^- \[" content/llms.txt || echo 0)
if [ "$ENTRIES" -lt "$PAGES" ]; then
  echo "llms.txt has $ENTRIES entries but $PAGES pages exist. Update llms.txt."
  exit 1
fi

This does not replace the Claude audit but catches obvious drift automatically. See claude-code-workflow for how to wire hooks.

Verify it worked

# No format errors
python validate_llms.py llms-fixed.txt
# Expected: { "format_errors": [], "missing_pages": [], ... }
 
# File is reachable
curl -si https://example.com/llms.txt | head -3
# HTTP/2 200
# content-type: text/plain
 
# Entries count matches pages
grep -c "^- \[" content/llms.txt

Common errors

  • Claude returns hallucinated page names. The model is generating pages that do not exist in the actual-pages list. Reduce max_tokens and constrain the system prompt to “only reference pages in the provided list.”
  • JSON output is malformed. The model included Markdown around the JSON block. Post-process with jq or ask Claude to return only the raw JSON with no code fences.
  • Missing pages list is empty despite obvious gaps. The pages-actual.txt format does not match the URL format in llms.txt. Normalize both to the same URL pattern before running the prompt.
  • CI count check has false positives. Index files and draft pages inflate find output. Add --exclude patterns for index.md and filter by status: draft in frontmatter.
  • Descriptions pass validation but are poor quality. Add a quality_issues key to the prompt output that flags descriptions that are tautological (“The index page is the index of the site”) or under 10 words.