Overview

A modern public site ships ten or so small files at well-known paths that crawlers, agents, and tooling expect to find without negotiation. Get them wrong and you lose indexing, AI-training opt-out, instant-crawl pings, search-engine verification, and accessibility metadata. This page is the master catalog: every file, its required location, its purpose, the rules for the body, and a link to the deep-dive. Use it as the pre-launch checklist and the post-launch audit list.

The catalog

FilePathPurposeRequired if
robots.txt/robots.txtCrawl scope and sitemap pointerAlways
sitemap.xml/sitemap.xmlEnumerate every canonical URL with lastmodAlways
llms.txt/llms.txtAgent-facing index of canonical pagesLLM/agent traffic matters
llms-full.txt/llms-full.txtFull text dump of priority pages, for agent ingestionVault sites and reference docs
ai.txt/ai.txtAI training opt-out declarationYou care about training-set inclusion
IndexNow key/<32-char-key>.txtProves ownership for IndexNow pingsUsing IndexNow on Bing or Yandex
security.txt/.well-known/security.txtVulnerability disclosure contact (RFC 9116)Always for production sites
humans.txt/humans.txtCredit the people behind the siteOptional, low-cost
manifest.json/manifest.json or /manifest.webmanifestPWA install metadata, icons, themeMobile traffic matters
favicon.ico/favicon.icoBrowser tab icon, SERP faviconAlways
OG default image/og-default.png (1200×630)Fallback social cardAlways
Apple touch icon/apple-touch-icon.png (180×180)iOS home-screen iconiOS traffic matters

Rules per file

robots.txt

Minimum body:

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

One Sitemap: line per sitemap. Only Disallow: paths that genuinely should not be crawled. Reference llms-txt from robots.txt is a common pattern: # llms.txt: https://example.com/llms.txt. Deep dive: technical.

sitemap.xml

Every canonical URL with a real <lastmod> from the page’s last_updated. Regenerate on build; never hand-edit. Split into multiple sitemaps with a sitemap index when you cross 50,000 URLs. Submit once in Google Search Console and Bing Webmaster Tools; engines refetch on their own. Pair with indexnow for sub-hour change notification.

llms.txt

Agent-facing markdown index of the site’s canonical pages, grouped by section, each with a one-sentence summary. Lives at /llms.txt, served as text/plain or text/markdown. Deep dive and authoring rules: llms-txt. End-to-end howto: ship-llms-txt. Auditing an existing one: audit-llms-txt-with-claude.

llms-full.txt

The full text content of the priority pages, concatenated, served at /llms-full.txt. Designed for agents that want to ingest the whole vault in one fetch rather than crawl page by page. Cap at ~2-3 MB to stay friendly to context windows. Generate at build from the same frontmatter that drives /llms.txt.

ai.txt

Declares the site’s stance on AI training-set inclusion. Spawning’s proposed format covers Disallow: and Allow: directives per crawler. Deep dive: ai-txt. Cross-link from robots.txt as a comment.

IndexNow key file

A flat text file at /<key>.txt whose body is the same 32-character key, proving ownership. Required before Bing or Yandex will accept IndexNow pings on the domain. Deep dive: indexnow.

security.txt

Per RFC 9116, place at /.well-known/security.txt. Required fields: Contact:, Expires:. Optional: Preferred-Languages:, Canonical:, Acknowledgments:. Refresh annually so Expires: stays in the future.

Contact: mailto:security@example.com
Expires: 2027-01-01T00:00:00Z
Preferred-Languages: en

humans.txt

Credit the people, tools, and stack behind the site. Free-form text. No standard schema; the convention is Team:, Tools:, Thanks: sections.

manifest.json

PWA metadata: app name, icons, theme color, display mode, start URL. Reference from HTML: <link rel="manifest" href="/manifest.webmanifest">. Validate with the Application panel in Chrome DevTools.

favicon.ico, OG image, Apple touch icon

Static binary assets at fixed paths. Reference each from the HTML <head>:

<link rel="icon" href="/favicon.ico">
<link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png">
<meta property="og:image" content="https://example.com/og-default.png">

Per-page OG images override the default; see og-images for the dynamic-generation pattern.

Pre-launch checklist

Run this once before flipping DNS to production:

HOST="https://example.com"
for path in /robots.txt /sitemap.xml /llms.txt /llms-full.txt /ai.txt \
            /.well-known/security.txt /humans.txt /manifest.webmanifest \
            /favicon.ico /og-default.png /apple-touch-icon.png; do
  status=$(curl -s -o /dev/null -w "%{http_code}" "$HOST$path")
  echo "$status  $path"
done

Every line should return 200. Investigate any non-200 and fix before the launch announcement. End-to-end build wiring for a Quartz site: static-site-seo.