Overview

Match the bot to the behavior before you allow or block it. AI crawlers split into three jobs: training ingestion, search indexing for an AI answer engine, and live user-triggered fetches. Blocking one does not block the others, and the user-agent strings change as vendors split their fleets. This card lists the agents that matter as of 2026 and the exact robots.txt token for each. For where these tokens live and how they interact with llms.txt and ai.txt, see discoverability-files.

Know the three crawler jobs

Treat the job, not the brand, as the unit of control.

  • Training: ingests pages to train or fine-tune a model. Examples: GPTBot, Google-Extended, Applebot-Extended, CCBot, anthropic-ai.
  • Search indexing: builds the index an AI answer engine cites. Examples: OAI-SearchBot, Claude-SearchBot, PerplexityBot.
  • User fetch: retrieves a single page because a user asked a chatbot about it. Examples: ChatGPT-User, Claude-User, Perplexity-User. These honor robots.txt but are not bulk crawlers.

Reference table

User-agentOperatorJobBlock in robots.txt
GPTBotOpenAITrainingUser-agent: GPTBot
OAI-SearchBotOpenAISearch indexUser-agent: OAI-SearchBot
ChatGPT-UserOpenAIUser fetchUser-agent: ChatGPT-User
ClaudeBotAnthropicTrainingUser-agent: ClaudeBot
Claude-SearchBotAnthropicSearch indexUser-agent: Claude-SearchBot
Claude-UserAnthropicUser fetchUser-agent: Claude-User
Google-ExtendedGoogleTraining (Gemini)User-agent: Google-Extended
GooglebotGoogleSearch indexUser-agent: Googlebot
PerplexityBotPerplexitySearch indexUser-agent: PerplexityBot
Perplexity-UserPerplexityUser fetchUser-agent: Perplexity-User
Applebot-ExtendedAppleTrainingUser-agent: Applebot-Extended
BytespiderByteDanceTrainingUser-agent: Bytespider
CCBotCommon CrawlTraining corpusUser-agent: CCBot
AmazonbotAmazonSearch and assistantUser-agent: Amazonbot
Meta-ExternalAgentMetaTrainingUser-agent: Meta-ExternalAgent
cohere-aiCohereTraining and inferenceUser-agent: cohere-ai
MistralAI-UserMistralUser fetchUser-agent: MistralAI-User
DiffbotDiffbotKnowledge graphUser-agent: Diffbot

Allow or block with full rules, not a bare token

Set the policy per agent, then state the directive. A User-agent line with no Disallow or Allow is ambiguous; always pair them.

# Allow everything (this site's posture)
User-agent: GPTBot
Allow: /

# Block training but keep search-index access
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

Blocking GPTBot removes your pages from training but leaves ChatGPT-User and OAI-SearchBot free to fetch and cite you. To stay out of AI answers entirely, block the search-index and user-fetch agents too.

Watch for fleet splits and stale tokens

Vendors rename and split agents; an allowlist written for last year leaks. Anthropic retired Claude-Web in favor of Claude-User and Claude-SearchBot; Google separated Google-Extended (training) from Googlebot (search). Re-check operator docs each quarter, and prefer an explicit allow for the agents you want over a single wildcard. This site allows every agent because each page is built to be cited; see for-ai-agents and the curated index at llms-txt. For crawl-volume tradeoffs on large sites, see crawl-budget and ai-search-optimization.