Run free audit
AI Engines and Surfaces

AI Crawler

An AI crawler is an automated user agent operated by an AI company that fetches public web pages to use either for training large language models or for real-time grounding inside AI answers, with named examples including GPTBot, ClaudeBot, PerplexityBot, Google-Extended and CCBot.

Also known as:GPTBot, ClaudeBot, PerplexityBot, Google-Extended, AI bot, LLM crawler

AI crawlers fall into two practical groups. Training crawlers gather content that is folded into the next generation of a model. Real-time crawlers fetch pages on demand to ground a specific answer that a user just asked for. Some operators run both kinds. The same site can be visited by training and real-time crawlers and may want to allow one and block the other.

Each crawler announces itself with a specific user agent string. A robots.txt file controls which crawlers may access which paths. Blocking a training crawler means future model versions will not learn from the site, but the live AI product can still cite the site if its real-time crawler is allowed. Blocking a real-time crawler means the AI product cannot include the site in answers, which usually has a larger visibility cost.

Common named crawlers include GPTBot (training for OpenAI), OAI-SearchBot (search and citation for ChatGPT), PerplexityBot and Perplexity-User (Perplexity), ClaudeBot and anthropic-ai (the AI assistant family from Anthropic), Google-Extended (controls use of content for Googles AI products), CCBot (Common Crawl, used by many downstream models) and Applebot-Extended. The list evolves quickly and a site that wants to manage AI visibility should check it periodically.

Key points

  • AI crawlers fetch web pages for training, real-time answers, or both.
  • Each crawler identifies itself by a specific user agent string.
  • robots.txt is the standard way to allow or block AI crawlers per path.
  • Blocking a real-time crawler usually has more impact on visibility than blocking a training crawler.

Frequently asked questions

What is GPTBot?

GPTBot is OpenAIs training crawler. It fetches public web pages so they can be included in future model training data. It can be allowed or blocked in robots.txt.

Should I block AI crawlers?

Most sites that want to be visible in AI answers should allow real-time crawlers, even if they choose to block training crawlers. Blocking real-time crawlers usually removes the site from the AI products answer surface.

How do I see which AI crawlers visit my site?

Server logs and analytics tools list user agent strings. Looking for GPTBot, ClaudeBot, PerplexityBot, Google-Extended and similar strings tells you which AI crawlers have reached the site.

Related VisibAI tools

Related terms

robots.txt for AI Crawlers
robots.txt for AI crawlers is the use of the standard /robots.txt file to allow or block specific AI user agents such as GPTBot, ClaudeBot, PerplexityBot and Google-Extended, controlling which AI systems can access the sites content for training or real-time grounding.
llms.txt
llms.txt is a proposed plain-text file placed at the root of a website that gives large language models a concise, curated map of the sites most important pages and content sections, so AI systems can find the right pages without having to crawl the entire site.
Large Language Model (LLM)
A large language model (LLM) is a machine learning model trained on huge amounts of text to predict the next token in a sequence, which lets it generate fluent natural-language responses and power products such as ChatGPT, Perplexity, Gemini and Copilot.
Retrieval-Augmented Generation (RAG)
Retrieval-augmented generation (RAG) is an AI architecture that first retrieves relevant documents from an external source and then feeds them to a language model so the model can ground its answer in those documents rather than relying only on what it memorized during training.
See how AI engines describe your brand.

Free audit. Score across ChatGPT, Perplexity, Gemini and Google AI Overviews.

Run a free audit
Back to the dictionary