Common Crawl Is an AI Training Pipeline. Publishers Are Done Pretending Otherwise.
May 5, 2026
Editorial Policy
All of our content is generated by subject matter experts with years of ad tech experience and structured by writers and educators for ease of use and digestibility. Learn more about our rigorous interview, content production and review process here.
Key Points
- The News/Media Alliance formally demanded that Common Crawl remove publisher content, revise its terms of use, and explicitly prohibit AI training use of its archive.
- Over 60% of Common Crawl's 2024 donated funds came from entities directly affiliated with generative AI companies, including Anthropic, OpenAI, and the Schmidt Foundation.
- Common Crawl's opt-out registry is buried in a footnote and carries no enforceable directive against AI training use.
- Publishers blocking AI crawlers via robots.txt faced a 23.1% monthly visit decline in early 2026, with no corresponding reduction in AI citations.
- Whatever you decide about blocking, the traffic you still have needs to be working harder than ever.
What Happened
The News/Media Alliance sent a formal demand letter to Common Crawl on April 29, 2026. The letter, first reported by Bloomberg and covered in depth by PPC Land, demands four specific actions from the nonprofit web archive organization.
The NMA is asking Common Crawl to remove publisher content upon request, publish a clear statement that it does not own or authorize use of scraped content, revise its terms of use to explicitly prohibit AI training use, and add enforceable warnings to its opt-out registry. Exhibit A of the letter includes hundreds of domain names from publishers ranging from NBCUniversal and CNN to McClatchy, Vox Media, Ziff Davis, USA Today, and dozens of regional outlets.
See It In Action:
- WaPo Cuts 300 Staff as AI Search Erodes Publisher Traffic: A real-world case study in what AI-driven traffic erosion looks like at scale for a major news publisher.
- Agency Revenue Drops Signal AI Search Traffic Shift: How agency revenue data is revealing the downstream impact of AI search on publisher monetization.
- News Publishers Ad Revenue Resource Center: Monetization strategy, tools, and guidance built specifically for news publishers under traffic pressure.
Why This Matters More Than Another Open Letter
Common Crawl is not just a passive archive. It is infrastructure. OpenAI used it to train GPT-3 in 2020. Google used the C4 subset to develop what became Bard. GPT-3.5, built on Common Crawl data, became the foundation for ChatGPT. The NMA's letter cites these connections explicitly, using the AI companies' own research papers as documentation.
The funding picture makes the "neutral nonprofit" framing hard to sustain. According to the NMA's letter, drawing on Common Crawl's Form 990 filings published on ProPublica, over 60% of donated funds in 2024 came from entities directly or closely affiliated with generative AI companies or data brokers. More than half of those donations came from three sources: Anthropic, OpenAI, and the Schmidt Foundation.
The indemnity clauses buried in Common Crawl's terms of use are the most telling detail. Those terms already cover the use of crawled content for "developing, training, or deploying AI Systems" and include liability shields for "infringement or misappropriation of any third party's patent, trademark, copyright." The NMA's letter makes the obvious point: you don't write liability protection for something you don't expect to happen.
Essential Background Reading:
- AI Crawler Resource Center for Publishers: The full hub for understanding how AI crawlers work, what they're taking, and what publishers can do about it.
- AI Scraping vs. Traditional SEO Crawling: Why AI scrapers and search crawlers aren't the same thing, and why that distinction matters for your robots.txt decisions.
- Ad Tech Crawlers You Should Never Block: A guide to distinguishing revenue-generating bots from content-consuming ones before you start blocking.
- Playwire on Generative AI: Our position on generative AI and what it means for publishers partnering with us.
The Opt-Out Problem Is Structural, Not Administrative
Common Crawl does have an opt-out registry. Publishers can request exclusion from future crawls. The NMA acknowledges this, then explains exactly why it doesn't work.
The registry is listed as one of 27 subsections at the bottom of Common Crawl's homepage. It contains no directive to developers using the archive, no prohibition on AI training use, and no enforcement mechanism. Some NMA members submitted removal requests more than two and a half years before this letter was filed. Their content remains in the archive.
The robots.txt path has its own failure mode. Common Crawl states it will honor properly configured robots.txt files, but the archive already contains years of content scraped before those directives existed. A publisher who blocked Common Crawl's crawler in 2024 did not un-scrape their 2019 journalism.
Here is where the situation stands for publishers weighing their options on blocking AI crawlers:
| Approach | Covers Future Crawls | Removes Existing Archive Content | Carries Enforcement Risk |
|---|---|---|---|
| robots.txt blocking | Yes, if honored | No | None |
| Common Crawl opt-out registry | Partially | Not reliably | None |
| NMA formal demand | Requested | Requested | Uncertain |
| Legal action | Depends on outcome | Depends on outcome | High |
No single mechanism closes the historical gap. That's what the NMA is actually trying to force Common Crawl to address.
Related Content:
- AI Training vs. AI Search Crawlers: Does blocking AI training crawlers actually hurt your referral traffic from AI search tools? The data-driven answer.
- How AI Crawlers Impact Entertainment Website Traffic and Ad Revenue: Vertical-specific analysis of crawler impact for entertainment publishers.
- AI Crawler Impact on Lifestyle Publisher Traffic: What the traffic data actually shows for lifestyle publishers navigating the crawler problem.
- Big Tech's AI Licensing Report Card: Where the major AI licensing deals stand and what publishers should actually do in response.
What Publishers Should Do Right Now
The legal and policy fight will play out on its own timeline. Publishers need to make practical decisions now.
A few things worth doing regardless of how the NMA's demands are received:
- Check the opt-out registry: Confirm your domains are listed. The NMA's Exhibit A is a model. If you haven't formally submitted a removal request, do it now so the record exists.
- Audit your robots.txt: Make sure CCBot is explicitly disallowed if that's your position. Common Crawl's crawler user agent is
CCBot. This won't remove historical content, but it stops future scraping. - Document your removal requests: Dates, methods, and any responses. The legal standing problem courts have identified requires demonstrable harm. A paper trail matters.
- Evaluate the blocking tradeoff carefully: Research published in early 2026 found that publishers who blocked AI crawlers via robots.txt saw a 23.1% monthly visit decline. Blocking costs traffic. Not blocking costs content. There's no clean answer.
If you want to assess where your crawler protection stands right now, our AI Crawler Protection Grader gives you a fast, technical read on your current configuration. The AI Crawler Resource Center has the implementation details if you want to go deeper.
Next Steps:
- AI Crawler Protection Grader: Get a fast technical read on how well your current robots.txt and blocking configuration actually protects your content.
- The AI Search Reckoning: Why publishers need a coherent strategy rather than reactive blocking decisions.
- AI Traffic Is the New SEO: How to position your content strategy to capture referral traffic from AI search tools.
- AI and Publishers Resource Center: The broader strategy hub covering everything from crawler decisions to monetization in an AI-driven traffic environment.
The Traffic You Have Still Needs to Work
Publishers are losing ground on two fronts simultaneously. AI companies are consuming content without sending traffic back. Cloudflare data from August 2025 showed Anthropic crawling 38,000 pages for every single referred visit. Training-related crawling accounted for nearly 80% of all AI bot activity.
Blocking doesn't fully solve that, either. The 23.1% monthly visit decline among publishers who blocked AI crawlers means the traffic pool is shrinking regardless of which path you choose. Agency revenue data signals this AI search traffic shift is already hitting publisher bottom lines.
That makes yield optimization on your remaining traffic more important, not less. Every session carries more weight when total sessions are under pressure. RPS matters more when volume is declining. If your monetization stack isn't pulling maximum value from the traffic you do have, you're compounding the problem. Publishers navigating this shift are optimizing for AI referrals as the new SEO while shoring up yield on existing sessions.
We work with publishers across gaming, news, entertainment, and education to make sure that doesn't happen. The conversation about what your current inventory is actually worth is worth having now, not after another quarter of traffic erosion. Talk to us.
