robots.txt for AI bots: Complete guide

Configure robots.txt for AI crawlers. Every directive, every major bot, and why robots.txt alone isn't enough.

What is robots.txt

robots.txt is a plain text file at the root of your site (yoursite.com/robots.txt). It tells automated clients which paths they may visit. It is the web's oldest way to talk to crawlers. Engineers designed it in 1994 for search engines. Now it sits at the center of the AI crawler debate.

The file holds a set of directives. User-agent picks the crawler. Disallow lists blocked paths. Allow carves out exceptions. Crawlers fetch /robots.txt first. Then they honor what they find.

Why robots.txt matters right now

robots.txt is the public statement of your crawling policy. In 2026 it is also the most ignored file on the web.

Our data shows about 30% of AI bot scrapes ignore robots.txt. We measure ChatGPT-User fetching 42% of sites that blocked it. In our data, AI bots reach 39% of the top one million websites. Only 2.98% block those bots in robots.txt.

This gap makes the file both important and weak. A compliant crawler reads it first. A lawyer quotes it first. But the file does not enforce anything.

Types of robots.txt directives for AI crawlers

You address AI crawlers by user agent. The main strings in 2026 are GPTBot, ChatGPT-User, and OAI-SearchBot (OpenAI), ClaudeBot (Anthropic), Google-Extended, Bytespider, Applebot-Extended, CCBot, PerplexityBot, Amazonbot, Meta-ExternalAgent, and cohere-ai.

Three configurations cover most policies.

Block all AI crawlers, allow search engines. Add a Disallow for each AI user agent. Keep Googlebot and Bingbot allowed. This is the default for publishers who want search visibility but no AI training.

Allow AI crawlers, restrict to sections. Set Allow and Disallow paths per user agent. For example, allow /public/ and block /archive/ for GPTBot. This helps when you want AI search indexing but keep training crawlers away from premium content.

Blanket allow. Use a single User-agent: * with an empty Disallow. Many sites default to this. That is why 97% of the top million are open to AI bots today.

robots.txt cannot tell honest identification from a spoofed user agent. It cannot set different rules for different uses by the same crawler. The directives are coarse.

How robots.txt works

Crawlers should request /robots.txt before they crawl. They parse the directives. Then they follow the rules for their user agent. The protocol runs on honor. A crawler that ignores the file hits no technical barrier.

Matching uses the longest prefix per user agent. It falls back to User-agent: *. Many crawlers cache the file for up to 24 hours. So a policy change takes time to spread.

The file is also public. Anyone can read yoursite.com/robots.txt, including the crawlers you want to block. That openness helps a compliant ecosystem. It hurts when an operator reads the file as a map of what to take.

How to detect when robots.txt is ignored

Three checks close the gap between what your robots.txt says and what happens.

Log sampling against declared blocks. Pull the list of user agents you disallowed. Search access logs for hits from those agents. Any match is one of two things. It is an honest crawler that missed the update. Or it is a dishonest one that read the file and kept going. Our 42% ChatGPT-User figure came from this kind of check.

User agent honesty. A GPTBot request should come from an IP in OpenAI's published range. Googlebot should resolve by reverse DNS to a Google host. A declared user agent from an IP outside the operator's range is a spoof.

Blocked-UA traffic trend. Track the volume from disallowed user agents over time. If the number does not fall after you add the Disallow, the file is informing the crawler, not stopping it.

How to enforce access when robots.txt fails

For real access control, you need a layer that identifies crawlers no matter what they claim. It must decide per request in real time.

That layer sits at the edge, before requests reach origin. It matches the TLS fingerprint against a library signature. It checks the HTTP/2 SETTINGS frame for browser-vs-library markers. It correlates the user agent with the origin IP's autonomous system. It runs those checks against a database of known crawler signatures. Centinel tracks 1,600+ crawler fingerprints. A scraper using curl-impersonate to look like Chrome is caught on the TLS handshake, not on the body of the request.

Once identified, the crawler can be blocked, verified and allowed, or redirected to a paid licensing path. None of those options exist in robots.txt. All three are per-request decisions. robots.txt was never built to enforce. It is a courtesy notice. Enforcement is a separate job.

Key takeaways

robots.txt is the public statement of your policy and the first file a compliant crawler reads. But 30% of AI bot scrapes ignore it. And 42% of sites that block ChatGPT-User still see it fetch their pages.
Update the file often as new AI crawlers appear. GPTBot, ClaudeBot, Google-Extended, ChatGPT-User, Applebot-Extended, and Meta-ExternalAgent all matter in 2026.
Use robots.txt as a baseline, not your only defense. It cannot tell honest identification from a spoof. It cannot set per-use rules for the same crawler.
Enforcement lives at the edge. TLS fingerprinting, HTTP/2 checks, and a crawler signature database turn the courtesy notice into a real policy.