Learning Center

Common Crawl Is an AI Training Pipeline. Publishers Are Done Pretending Otherwise.

May 5, 2026

Show Editorial Policy

shield-icon-2

Editorial Policy

All of our content is generated by subject matter experts with years of ad tech experience and structured by writers and educators for ease of use and digestibility. Learn more about our rigorous interview, content production and review process here.

Common Crawl Is an AI Training Pipeline. Publishers Are Done Pretending Otherwise.
Ready to be powered by Playwire?

Maximize your ad revenue today!

Apply Now

Key Points

  • The News/Media Alliance formally demanded that Common Crawl remove publisher content, revise its terms of use, and explicitly prohibit AI training use of its archive.
  • Over 60% of Common Crawl's 2024 donated funds came from entities directly affiliated with generative AI companies, including Anthropic, OpenAI, and the Schmidt Foundation.
  • Common Crawl's opt-out registry is buried in a footnote and carries no enforceable directive against AI training use.
  • Publishers blocking AI crawlers via robots.txt faced a 23.1% monthly visit decline in early 2026, with no corresponding reduction in AI citations.
  • Whatever you decide about blocking, the traffic you still have needs to be working harder than ever.

What Happened

The News/Media Alliance sent a formal demand letter to Common Crawl on April 29, 2026. The letter, first reported by Bloomberg and covered in depth by PPC Land, demands four specific actions from the nonprofit web archive organization.

The NMA is asking Common Crawl to remove publisher content upon request, publish a clear statement that it does not own or authorize use of scraped content, revise its terms of use to explicitly prohibit AI training use, and add enforceable warnings to its opt-out registry. Exhibit A of the letter includes hundreds of domain names from publishers ranging from NBCUniversal and CNN to McClatchy, Vox Media, Ziff Davis, USA Today, and dozens of regional outlets.

See It In Action:

Why This Matters More Than Another Open Letter

Common Crawl is not just a passive archive. It is infrastructure. OpenAI used it to train GPT-3 in 2020. Google used the C4 subset to develop what became Bard. GPT-3.5, built on Common Crawl data, became the foundation for ChatGPT. The NMA's letter cites these connections explicitly, using the AI companies' own research papers as documentation.

The funding picture makes the "neutral nonprofit" framing hard to sustain. According to the NMA's letter, drawing on Common Crawl's Form 990 filings published on ProPublica, over 60% of donated funds in 2024 came from entities directly or closely affiliated with generative AI companies or data brokers. More than half of those donations came from three sources: Anthropic, OpenAI, and the Schmidt Foundation.

The indemnity clauses buried in Common Crawl's terms of use are the most telling detail. Those terms already cover the use of crawled content for "developing, training, or deploying AI Systems" and include liability shields for "infringement or misappropriation of any third party's patent, trademark, copyright." The NMA's letter makes the obvious point: you don't write liability protection for something you don't expect to happen.

Essential Background Reading:

The Opt-Out Problem Is Structural, Not Administrative

Common Crawl does have an opt-out registry. Publishers can request exclusion from future crawls. The NMA acknowledges this, then explains exactly why it doesn't work.

The registry is listed as one of 27 subsections at the bottom of Common Crawl's homepage. It contains no directive to developers using the archive, no prohibition on AI training use, and no enforcement mechanism. Some NMA members submitted removal requests more than two and a half years before this letter was filed. Their content remains in the archive.

The robots.txt path has its own failure mode. Common Crawl states it will honor properly configured robots.txt files, but the archive already contains years of content scraped before those directives existed. A publisher who blocked Common Crawl's crawler in 2024 did not un-scrape their 2019 journalism.

Here is where the situation stands for publishers weighing their options on blocking AI crawlers:

ApproachCovers Future CrawlsRemoves Existing Archive ContentCarries Enforcement Risk
robots.txt blockingYes, if honoredNoNone
Common Crawl opt-out registryPartiallyNot reliablyNone
NMA formal demandRequestedRequestedUncertain
Legal actionDepends on outcomeDepends on outcomeHigh

No single mechanism closes the historical gap. That's what the NMA is actually trying to force Common Crawl to address.

Related Content:

What Publishers Should Do Right Now

The legal and policy fight will play out on its own timeline. Publishers need to make practical decisions now.

A few things worth doing regardless of how the NMA's demands are received:

  • Check the opt-out registry: Confirm your domains are listed. The NMA's Exhibit A is a model. If you haven't formally submitted a removal request, do it now so the record exists.
  • Audit your robots.txt: Make sure CCBot is explicitly disallowed if that's your position. Common Crawl's crawler user agent is CCBot. This won't remove historical content, but it stops future scraping.
  • Document your removal requests: Dates, methods, and any responses. The legal standing problem courts have identified requires demonstrable harm. A paper trail matters.
  • Evaluate the blocking tradeoff carefully: Research published in early 2026 found that publishers who blocked AI crawlers via robots.txt saw a 23.1% monthly visit decline. Blocking costs traffic. Not blocking costs content. There's no clean answer.

If you want to assess where your crawler protection stands right now, our AI Crawler Protection Grader gives you a fast, technical read on your current configuration. The AI Crawler Resource Center has the implementation details if you want to go deeper.

Next Steps:

The Traffic You Have Still Needs to Work

Publishers are losing ground on two fronts simultaneously. AI companies are consuming content without sending traffic back. Cloudflare data from August 2025 showed Anthropic crawling 38,000 pages for every single referred visit. Training-related crawling accounted for nearly 80% of all AI bot activity.

Blocking doesn't fully solve that, either. The 23.1% monthly visit decline among publishers who blocked AI crawlers means the traffic pool is shrinking regardless of which path you choose. Agency revenue data signals this AI search traffic shift is already hitting publisher bottom lines.

That makes yield optimization on your remaining traffic more important, not less. Every session carries more weight when total sessions are under pressure. RPS matters more when volume is declining. If your monetization stack isn't pulling maximum value from the traffic you do have, you're compounding the problem. Publishers navigating this shift are optimizing for AI referrals as the new SEO while shoring up yield on existing sessions.

We work with publishers across gaming, news, entertainment, and education to make sure that doesn't happen. The conversation about what your current inventory is actually worth is worth having now, not after another quarter of traffic erosion. Talk to us.

New call-to-action