AI Engines and Surfaces

AI Crawler

An AI crawler is an automated user agent operated by an AI company that fetches public web pages to use either for training large language models or for real-time grounding inside AI answers, with named examples including GPTBot, ClaudeBot, PerplexityBot, Google-Extended and CCBot.

Also known as:GPTBot, ClaudeBot, PerplexityBot, Google-Extended, AI bot, LLM crawler

AI crawlers fall into two practical groups. Training crawlers gather content that is folded into the next generation of a model. Real-time crawlers fetch pages on demand to ground a specific answer that a user just asked for. Some operators run both kinds. The same site can be visited by training and real-time crawlers and may want to allow one and block the other.

Each crawler announces itself with a specific user agent string. A robots.txt file controls which crawlers may access which paths. Blocking a training crawler means future model versions will not learn from the site, but the live AI product can still cite the site if its real-time crawler is allowed. Blocking a real-time crawler means the AI product cannot include the site in answers, which usually has a larger visibility cost.

Common named crawlers include GPTBot (training for OpenAI), OAI-SearchBot (search and citation for ChatGPT), PerplexityBot and Perplexity-User (Perplexity), ClaudeBot and anthropic-ai (the AI assistant family from Anthropic), Google-Extended (controls use of content for Googles AI products), CCBot (Common Crawl, used by many downstream models) and Applebot-Extended. The list evolves quickly and a site that wants to manage AI visibility should check it periodically.

Key points

AI crawlers fetch web pages for training, real-time answers, or both.
Each crawler identifies itself by a specific user agent string.
robots.txt is the standard way to allow or block AI crawlers per path.
Blocking a real-time crawler usually has more impact on visibility than blocking a training crawler.

Frequently asked questions

What is GPTBot?

GPTBot is OpenAIs training crawler. It fetches public web pages so they can be included in future model training data. It can be allowed or blocked in robots.txt.

Should I block AI crawlers?

Most sites that want to be visible in AI answers should allow real-time crawlers, even if they choose to block training crawlers. Blocking real-time crawlers usually removes the site from the AI products answer surface.

How do I see which AI crawlers visit my site?

Server logs and analytics tools list user agent strings. Looking for GPTBot, ClaudeBot, PerplexityBot, Google-Extended and similar strings tells you which AI crawlers have reached the site.