Technical Standards

robots.txt for AI Crawlers

robots.txt for AI crawlers is the use of the standard /robots.txt file to allow or block specific AI user agents such as GPTBot, ClaudeBot, PerplexityBot and Google-Extended, controlling which AI systems can access the sites content for training or real-time grounding.

Also known as:AI robots.txt, robots.txt AI rules, AI crawler access policy

robots.txt is an older web standard that lets a site tell crawlers which paths they may or may not fetch. AI companies have adopted it for their own crawlers and publish the user agent names they use. A site that wants to manage AI access lists those user agents in robots.txt with explicit Allow or Disallow rules.

Two categories of decision matter. The first is whether to allow training crawlers, which fetch content to be included in future model training. Blocking them protects content from being absorbed into training data but does not affect whether current AI products can answer about the brand. The second is whether to allow real-time crawlers, which fetch pages on demand to ground specific answers. Blocking real-time crawlers usually removes the site from the AI products answer surface entirely.

Common user agents to consider include GPTBot, OAI-SearchBot, ClaudeBot, anthropic-ai, PerplexityBot, Perplexity-User, Google-Extended (controls AI use specifically), CCBot (Common Crawl, used by many downstream models), Applebot-Extended and Bytespider. The list changes over time and a robots.txt that wants to remain meaningful needs occasional review.

Key points

robots.txt controls which AI crawlers may access which paths on a site.
Distinguishes (in practice) between training and real-time crawlers.
Blocking real-time crawlers usually removes the site from AI answer surfaces.
Crawler user-agent names change, so the file needs periodic review.

Frequently asked questions

How do I block AI crawlers in robots.txt?

Add a User-agent line naming the crawler (for example, User-agent: GPTBot) followed by a Disallow rule (Disallow: /). Repeat for each crawler you want to block. Common targets include GPTBot, ClaudeBot, PerplexityBot, Google-Extended and CCBot.

Will blocking AI crawlers remove my site from ChatGPT or Perplexity?

Blocking real-time crawlers usually removes the site from that products answer surface, because the product can no longer fetch your pages on demand. Blocking training crawlers does not affect current answers but prevents the content from contributing to future model versions.

What is Google-Extended?

Google-Extended is a robots.txt token used to control whether Google may use the sites content for its AI products (such as Bard/Gemini and AI Overviews) separately from its main search index. Disallowing Google-Extended opts out of AI use while keeping classic search indexing.