Fundamentals·7 min read

What is web scraping?

The mechanics of web scraping, why companies do it, the legal landscape, and how AI has changed the scraping game.

What is web scraping?

Web scraping is the automated extraction of data from websites. A scraper sends requests to a web server, receives the HTML response, and parses out the specific data it needs: product prices, article text, inventory counts, editorial content.

The technique is as old as the commercial web. What has changed in 2026 is who scrapes, at what scale, and what they do with the result.

Why web scraping matters right now

Web scraping serves many purposes. Price comparison, research, recruiting, competitive analysis — and increasingly, AI companies scraping the entire web to train their models.

The AI shift changed the math. Traditional scraping targets specific data points from specific sites. AI scraping is different in scale and purpose: model companies need massive volumes of diverse text, so they scrape broadly, deeply, and continuously, returning to the same sites repeatedly.

Tollbit’s Q4 2025 State of the Bots reported that across 550 billion website visits analyzed, 9 billion were AI bot scrapes, and 2.9 billion of those bypassed robots.txt. On publisher sites, Tollbit measured a 1-to-31 AI-bot-to-human visit ratio, up from 1-to-50 two quarters earlier.

The legal picture is contested and jurisdiction-dependent. In the US, the Computer Fraud and Abuse Act and copyright law provide some protections with inconsistent enforcement. The EU’s Database Directive protects structured data more firmly. Lawsuits against OpenAI and Anthropic are testing whether AI training counts as fair use. Law is catching up; infrastructure has to hold the line in the meantime.

Types of web scraping

Four kinds of scraper traffic show up on a production site, with different technical signatures and different commercial intent.

**Personal and research tools.** A developer running BeautifulSoup or Scrapy. A journalist pulling FOIA data. Small scale, identifiable user agents, usually easy to rate-limit.

**Commercial scraping services.** BrightData, Oxylabs, ScraperAPI, and others. Rotating residential proxies, browser automation, CAPTCHA solving. Detection is significantly harder because each request appears to come from a different home ISP.

**AI training crawlers.** GPTBot, ClaudeBot, Google-Extended, Bytespider, Applebot-Extended, CCBot. These mostly identify themselves and mostly respect robots.txt; the long tail of unlabeled training crawlers is where the volume is.

**Adversarial scrapers.** Targeting paywalled content, competitive pricing, or proprietary data. Use patched Chromium builds, curl-impersonate, or custom TLS libraries to reproduce a real browser’s handshake byte-for-byte.

How web scraping works

A scraper is a program that downloads web pages and extracts data from the HTML. Modern scrapers render JavaScript, solve CAPTCHAs, rotate through proxy networks, and mimic real browser behavior down to mouse movements and scroll patterns.

The request itself is HTTP. What the scraper controls is the user agent, TLS handshake, HTTP/2 settings, cookies, and timing. What it cannot fully control is the consistency across those layers. A Python HTTP library claiming to be Chrome sends a TLS fingerprint that reveals the lie on the first byte. A headless browser driving a scripted workflow has a behavioral rhythm — steady, deterministic, without the pauses a human makes — that separates it from a reader.

How to identify scraping on your site

Four signals tell you scraping is happening. None is decisive on its own; together they are.

Traffic anomalies show up first: a sudden rise in requests to product or article pages, especially outside business hours; a spike in 429 or 403 responses correlating with a drop in origin cache hit rate.

User agent honesty is the next check. A Chrome user agent from a cloud ASN is suspicious. A GPTBot from an IP outside OpenAI’s published ranges is an impersonator.

Rate patterns separate automation from readers. Humans dwell, scroll, and pause. Automation hits pages at a steady cadence or in synchronized bursts. Revisit intervals that align too cleanly with a clock — every 6 hours, every 24 hours — are a scheduler, not a reader.

Origin hops close it out. A session that touches twenty pages from twenty different ISPs in the same city is one scraper wearing twenty masks.

How to prevent unwanted scraping

Effective anti-scraping requires multiple layers: rate limiting, IP reputation, TLS fingerprinting, behavioral analysis, and crawler identification. No single technique is sufficient because scrapers adapt. The goal is not to block every request — it is to make scraping expensive enough that attackers move to easier targets.

Start at the edge, before the request reaches origin. Match the TLS fingerprint and HTTP/2 SETTINGS frame against a known library signature. Check the user agent for internal consistency. For the long tail, look up the signals against a fingerprint database — Centinel tracks 1,600+ — and decide per-request whether to block, challenge, or watchlist.

Pair this with legal posture. A public scraping policy, a clear access policy for training crawlers, and a clear terms-of-service clause give the technical layer something to point at when a scraper escalates.

Key takeaways

- Web scraping in 2026 is dominated by AI training and retrieval crawlers, not classic price scrapers: Tollbit measured 9 billion AI bot scrapes across 550 billion visits, with 2.9 billion bypassing robots.txt. - Four classes of scraper each need different handling: personal tools, commercial services, AI training crawlers, and adversarial scrapers. - Detection is a cross-layer problem: user agents, TLS fingerprints, HTTP/2 settings, and behavioral patterns together identify what any single signal misses. - Prevention is economic, not absolute. Make scraping more expensive than the data is worth and attackers move to softer targets.

See what's crawling your site right now

Run a free audit and get a detailed report of which AI crawlers are accessing your content. 48 hours.

Get your free audit