Practical guides·8 min read

How to block AI crawlers

A practical walkthrough of every blocking method, from robots.txt to edge-level detection, with the tradeoffs of each.

What is AI crawler blocking?

AI crawler blocking is the practice of denying automated clients that scrape content for AI training, retrieval, or agentic use. The methods range from a two-line entry in robots.txt to request-level detection at the edge, and they differ sharply in what they actually enforce. Two lines of robots.txt stop the compliant crawlers. For the rest, blocking means inspecting the request itself — TLS handshake, header consistency, behavioral cadence — and making a per-request decision before the bot reaches origin.

Why blocking AI crawlers matters right now

The volume is the argument. Tollbit's Q4 2025 State of the Bots measured 9 billion AI bot scrapes across 550 billion website visits, with 2.9 billion of those bypassing robots.txt entirely. Publisher sites saw a 1-to-31 AI-bot-to-human visit ratio, up from 1-to-50 two quarters earlier. Cloudflare Radar measured 39% of the top million sites being accessed by AI bots while only 2.98% block them in robots.txt.

The gap between policy and enforcement is where content theft happens. A site that relies only on robots.txt is informing crawlers of its preferences, not preventing anything. The blocking conversation in 2026 is about which layer the enforcement actually lives at.

Types of blocking methods

Six methods cover the landscape, ordered from simplest to most effective.

**robots.txt.** Add directives telling specific crawlers not to visit. For example, to block GPTBot, add `User-agent: GPTBot` followed by `Disallow: /`. Takes 30 seconds. No code changes. Purely voluntary — Tollbit data shows 30% of AI scrapes ignore robots.txt entirely, and ChatGPT-User fetched 42% of sites that blocked it.

**HTTP header checks.** Inspect the User-Agent header and reject known AI crawler signatures at the web server (Nginx, Apache) or in application code. Simple to deploy. Trivially bypassed by changing the user agent string.

**IP blocking.** Block IP ranges known to belong to AI companies. OpenAI, Anthropic, and others publish their ranges. Harder to bypass than user agent checks. Ranges change frequently, and residential proxy networks route around the block entirely.

**Rate limiting.** Limit requests per IP or session inside a time window. Reduces volume without full blocking. Sophisticated scrapers distribute requests across thousands of IPs. Aggressive limits also hurt legitimate users.

**JavaScript challenges.** Require the visitor to execute JavaScript before serving content. Stops basic HTTP-only scrapers. Modern scraping tools (Playwright, Puppeteer, patched Chromium) render JavaScript fully. Adds latency for real users.

**Edge-level detection.** A detection layer at the CDN or edge that analyzes every request in real time. Combines TLS fingerprinting, behavioral analysis, IP reputation, device fingerprinting, and crawler database matching. Catches crawlers regardless of user agent or IP. Sub-2ms latency. Requires a specialized provider. Centinel operates at this level, matching 1,600+ crawler signatures.

How crawler blocking works

The six methods are not substitutes. They are layers, and each catches a different class of bot.

robots.txt filters the honest crawlers who read the file and leave. Header and IP checks filter the trivially-lazy bots that identify themselves. Rate limiting filters the noisy scrapers that hit a single origin too hard. JavaScript challenges filter the HTTP-only libraries that cannot execute code. Edge-level detection filters everything that survives the previous five by inspecting low-level signals the bot cannot spoof cheaply.

A layered defense is cheaper and more effective than any single method pushed to its limit. The bot that beats your user-agent check may not beat your TLS fingerprint check. The one that spoofs TLS may still fail the behavioral check on the second page. Every layer raises the cost of evasion.

How to identify which method fits your site

Start with robots.txt as a baseline. It costs nothing and handles well-behaved crawlers. Pair it with header and IP checks at the server layer — cheap to add, covers the lazy 20%.

For any site where AI scraping is a material problem (publishers, e-commerce with proprietary catalogs, SaaS dashboards with valuable screens), add rate limiting and challenge pages on high-value routes. For real protection at volume, edge-level detection is the layer that enforces rather than informs.

The decision is driven by what is on the page. A marketing site loses little from being scraped. A paywalled publisher loses the business model. The higher the content value, the deeper in the stack the enforcement has to live.

How to respond when a method gets bypassed

Every blocking method has a countermeasure. The response is not to swap one method for another — it is to layer detection so no single bypass wins the session.

Monitor the bypass signal. A spike in traffic from a specific user agent, ASN, or TLS fingerprint after a block went live tells you the rule was registered and routed around. Escalate one layer down: from UA filtering to IP blocking, from IP blocking to TLS fingerprinting, from TLS to behavioral analysis. Refresh fingerprint databases on a cadence — published crawler signatures drift, new ones emerge, and stale databases miss traffic that a current one would flag.

Key takeaways

- robots.txt handles the honest crawlers and nothing else. Tollbit Q4 2025 measured 30% of AI scrapes ignoring it, and ChatGPT-User bypassed 42% of sites that blocked it. - Six methods cover the landscape: robots.txt, header checks, IP blocks, rate limits, JavaScript challenges, and edge-level detection. Each catches a different class of bot. - Layered defense beats any single method pushed to its limit. Edge-level detection at the CDN is the only layer that enforces rather than informs. - Response is a cadence, not a one-time setting. Monitor bypass signals, refresh fingerprint databases, escalate one layer down when a method gets routed around.

See what's crawling your site right now

Run a free audit and get a detailed report of which AI crawlers are accessing your content. 48 hours.

Get your free audit