Fundamentals·6 min read

What is an AI crawler?

How AI crawlers differ from traditional search engine bots, what data they collect, and why they matter for your business.

What is an AI crawler?

An AI crawler is an automated program that visits websites to collect data for training or operating artificial intelligence models. Unlike traditional search engine crawlers (like Googlebot) that index pages to serve search results, AI crawlers harvest content to build large language models, image generators, and other AI systems.

The request arrives over HTTP like any other. What separates an AI crawler from a reader is what happens to the content after it is fetched. A Googlebot crawl indexes a page so a human can find it. A GPTBot crawl feeds a model that will answer the human instead.

Why AI crawlers matter right now

The content AI crawlers take has real economic value. If your articles, product descriptions, or proprietary research train a model, you receive no compensation, no attribution, and no traffic. The resulting model competes with you by answering the same questions your content addresses.

Cloudflare Radar reports 39% of the top one million websites are accessed by AI bots as of early 2026, while only 2.98% actively block them. Anthropic’s crawl-to-referral ratio through 2025 was roughly 500,000 to 1 — half a million pages fetched for every visitor sent back.

AI agents are also the new intermediary between your content and the reader. An article read inside ChatGPT, summarized by Perplexity, or cited in an AI Overview produces no ad impression and no direct reader relationship. The traffic is real. The monetization path is not.

Types of AI crawlers

Four categories show up in server logs, each extracting different data for different reasons.

**Training crawlers.** GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended, Bytespider, Applebot-Extended, and CCBot sweep broadly to build training corpora. They pull text, code, structured data, images, and comments. Their requests are the clearest candidates for licensing.

**Retrieval and grounding crawlers.** PerplexityBot, OAI-SearchBot, and ChatGPT-User fetch pages at query time to ground a model’s answer. They are closer to search indexers than training crawlers, but they do not send referral traffic the way Googlebot does.

**Agentic traffic.** Generated by AI agents acting for a specific human user — a ChatGPT agent checking flight prices, a Claude agent researching a paper. Typically a headless browser on cloud infrastructure, often routed through residential proxies.

**Unlabeled and spoofed crawlers.** The largest and messiest category. cohere-ai, Meta-ExternalAgent, commercial services (BrightData, Oxylabs, ScraperAPI), and smaller operators rotating through residential IPs. Some target paywalled content to bypass access controls.

How AI crawlers work

Mechanically, AI crawlers are HTTP clients. Each request has a user agent, a TLS handshake, a set of HTTP/2 settings, and a body. The software stack and the intent separate a crawler from a browser.

Training crawlers are the simplest. A scheduler runs, a fetcher opens an HTTP connection, a parser extracts text and links, results go into a dataset. GPTBot and ClaudeBot publish IP ranges and respect robots.txt in most cases, with a predictable footprint: consistent user agent, consistent TLS fingerprint, steady cadence.

Retrieval crawlers are stateful and bursty, driven by query volume rather than a schedule. Agentic traffic is the hardest to characterize: patched Chromium, headless browsers, or direct HTTP clients depending on the task, often through residential proxies.

How to identify AI crawlers on your site

User agents are the starting point, not the answer. Major operators publish their strings (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, PerplexityBot, CCBot), and matching those in your logs identifies the compliant traffic — the traffic least likely to cause a problem.

For everything else, you need request-level signals the client does not fully control. TLS fingerprinting (JA4) exposes the library behind the handshake. HTTP/2 SETTINGS frames distinguish browsers from libraries by the WINDOW_UPDATE value and pseudo-header order. Behavioral patterns separate readers from crawlers. Cross-layer consistency is the decisive check: a request claiming Chrome via user agent, carrying a curl-impersonate TLS fingerprint, with Go-library HTTP/2 settings is a crawler that lied twice. Centinel maintains 1,600+ crawler fingerprints combining these layers.

How to respond to AI crawler traffic

Once the traffic is identified, three responses are available. Pick per agent, not per traffic source.

**Block.** For training crawlers you have not licensed. For scrapers that ignore robots.txt. For spoofed traffic that fails consistency checks. Block at the edge so the origin never sees the request.

**Verify and allow.** For search indexers you want to appear in. For partner agents. For AI-on-behalf-of-user traffic you want through but want to audit. Pass the request with a signed trust stamp and monitor cumulative volume per operator.

**Watchlist.** For training crawlers you have not decided about yet. Centinel records every visit per agent and gives you the audit trail to take action later — block, challenge, or escalate when policy is set.

robots.txt alone will not execute any of these. 32% of AI scrapes bypass it. Enforcement lives at the edge.

Key takeaways

- AI crawlers are a separate category from search engine bots: they extract content to train or ground models that compete with your site rather than send traffic back. - Four classes matter — training, retrieval, agentic, and unlabeled — and each calls for a different policy. - User agents alone cannot identify crawlers reliably; TLS fingerprinting, HTTP/2 signals, and cross-layer consistency checks catch the 32% that bypass robots.txt. - Your response is block, verify, or watchlist — a per-agent decision enforced at the edge, not a single site-wide setting.

See what's crawling your site right now

Run a free audit and get a detailed report of which AI crawlers are accessing your content. 48 hours.

Get your free audit