Overview
An llms.txt file tells AI crawlers what your site contains, which pages to prioritize, and how they relate. Like a sitemap, it drifts from the real site over time: pages are added without updating llms.txt, URLs change, descriptions go stale. This guide uses Claude to validate the format, flag missing pages, and surface stale descriptions. The spec lives in llms-txt.
Prerequisites
- A site with a live
/llms.txt. If not, follow ship-llms-txt first. curlon the path.- An Anthropic API key, or Claude Code open in the project.
- Optional: a list of all current page slugs (a sitemap XML or a directory listing).
Steps
1. Fetch the current llms.txt
curl -s https://example.com/llms.txt > llms-current.txt
wc -l llms-current.txt # sanity check: expect at least 10 linesIf the file returns a 404, the file is missing or misconfigured. See add-llms-txt-to-existing-site for the setup steps.
2. Gather the list of actual pages
For a static site, list the content files:
find content/ -name "*.md" ! -name "index.md" | sort > pages-actual.txtFor a deployed site with a sitemap:
curl -s https://example.com/sitemap.xml \
| grep -oP '(?<=<loc>)[^<]+' \
| sort > pages-actual.txtThe diff between pages-actual.txt and llms-current.txt identifies missing or stale entries.
3. Build the validation prompt
A precise prompt produces actionable output. Use a system prompt that specifies the expected format, then pass the file as context.
import anthropic
client = anthropic.Anthropic()
with open("llms-current.txt") as f:
llms_content = f.read()
with open("pages-actual.txt") as f:
pages = f.read()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
system="""You are an llms.txt format validator.
The llms.txt spec requires:
- A top-level H1 with the site name.
- One or more H2 sections grouping related pages.
- Each page listed as: - [Title](URL): one-sentence description.
- Descriptions must be 10-30 words and agent-routing hints (what will I find here).
- No duplicate URLs.
- All URLs must be absolute and end without a trailing slash.
Return a JSON object with keys:
format_errors: list of format violations (string each)
missing_pages: list of pages in the actual-pages list not found in llms.txt
stale_descriptions: list of entries whose description is vague or empty
summary: one sentence
""",
messages=[{
"role": "user",
"content": f"<llms_txt>\n{llms_content}\n</llms_txt>\n\n<actual_pages>\n{pages}\n</actual_pages>"
}]
)
print(response.content[0].text)See prompt-design for structuring validation prompts. The system prompt here follows the system-prompts pattern of specifying output format explicitly.
4. Review and apply fixes
Parse the JSON output and work through each category:
- Format errors: fix the
llms.txtfile directly. Common issues: descriptions over one sentence, missing H1, relative URLs. - Missing pages: add an entry for each missing page with a routing-hint description. A description should answer “what will an agent find on this page?“.
- Stale descriptions: rewrite any description that is vague (“General info about X”) or empty.
Apply fixes to your local content/llms.txt or wherever the file is generated.
5. Re-run Claude to confirm zero findings
After applying fixes, re-validate:
# If the site is local, build first
npx quartz build
curl -s http://localhost:8080/llms.txt > llms-fixed.txtRun the same script against llms-fixed.txt. The JSON output should have empty format_errors, missing_pages, and stale_descriptions arrays.
6. Add to CI to catch drift
Wire a lightweight check in CI that counts entries versus pages:
#!/bin/bash
PAGES=$(find content/ -name "*.md" ! -name "index.md" | wc -l)
ENTRIES=$(grep -c "^- \[" content/llms.txt || echo 0)
if [ "$ENTRIES" -lt "$PAGES" ]; then
echo "llms.txt has $ENTRIES entries but $PAGES pages exist. Update llms.txt."
exit 1
fiThis does not replace the Claude audit but catches obvious drift automatically. See claude-code-workflow for how to wire hooks.
Verify it worked
# No format errors
python validate_llms.py llms-fixed.txt
# Expected: { "format_errors": [], "missing_pages": [], ... }
# File is reachable
curl -si https://example.com/llms.txt | head -3
# HTTP/2 200
# content-type: text/plain
# Entries count matches pages
grep -c "^- \[" content/llms.txtCommon errors
- Claude returns hallucinated page names. The model is generating pages that do not exist in the actual-pages list. Reduce
max_tokensand constrain the system prompt to “only reference pages in the provided list.” - JSON output is malformed. The model included Markdown around the JSON block. Post-process with
jqor ask Claude to return only the raw JSON with no code fences. - Missing pages list is empty despite obvious gaps. The
pages-actual.txtformat does not match the URL format inllms.txt. Normalize both to the same URL pattern before running the prompt. - CI count check has false positives. Index files and draft pages inflate
findoutput. Add--excludepatterns forindex.mdand filter bystatus: draftin frontmatter. - Descriptions pass validation but are poor quality. Add a
quality_issueskey to the prompt output that flags descriptions that are tautological (“The index page is the index of the site”) or under 10 words.