What is Common Crawl and why does it matter to publishers?

Common Crawl is a nonprofit organization that maintains a publicly accessible archive of web content scraped at massive scale. It matters to publishers because it has served as foundational training data for major generative AI models: OpenAI used Common Crawl data to train GPT-3 in 2020, Google used the C4 subset derived from Common Crawl to develop what became Bard, and GPT-3.5 built on that same dataset to become the foundation for ChatGPT. Publishers whose content appears in the archive had no opportunity to consent to that use, and the historical data cannot be removed simply by blocking future crawls.

Does blocking CCBot in robots.txt remove publisher content from Common Crawl's archive?

No. Configuring robots.txt to disallow CCBot (Common Crawl's crawler user agent) prevents future scraping but has no effect on content that was already collected. A publisher who implemented CCBot blocking in 2024 did not remove journalism scraped in 2019 or any prior year. The historical archive gap is precisely what the News/Media Alliance's April 2026 formal demand to Common Crawl is attempting to address, since no existing self-service mechanism reliably removes previously scraped content.

What is the traffic cost of blocking AI crawlers, and how should publishers weigh it?

Research published in early 2026 found that publishers who blocked AI crawlers via robots.txt experienced a 23.1% monthly visit decline, with no corresponding reduction in AI citations of their content. This creates a painful tradeoff: blocking protects content from future scraping but accelerates traffic erosion, while not blocking preserves referral eligibility but leaves content exposed. Publishers navigating this tradeoff should focus on maximizing yield on remaining sessions, since declining traffic volume makes revenue per session (RPS) a more critical metric than aggregate CPM performance.

Learning Center

Common Crawl Is an AI Training Pipeline. Publishers Are Done Pretending Otherwise.

Q: What does the News/Media Alliance's formal demand to Common Crawl actually require?

The NMA's April 29, 2026 demand letter requests four specific actions: that Common Crawl remove publisher content upon request, publish a clear statement that it does not own or authorize use of scraped content, revise its terms of use to explicitly prohibit AI training use, and add enforceable warnings to its opt-out registry. Exhibit A of the letter names hundreds of publisher domains including NBCUniversal, CNN, McClatchy, Vox Media, Ziff Davis, and USA Today. The demands are currently requests with uncertain enforcement standing, not court orders.

Q: Why doesn't Common Crawl's opt-out registry solve the publisher content removal problem?

Common Crawl's opt-out registry is listed as one of 27 subsections at the bottom of its homepage and contains no directive to developers using the archive, no prohibition on AI training use, and no enforcement mechanism. Some NMA member publishers submitted removal requests more than two and a half years before the April 2026 demand letter was filed, and their content remains in the archive. The registry addresses future crawling in a limited way but does nothing to govern how the existing archive is used by third parties building AI systems.

Q: How does Common Crawl's funding structure relate to its AI training use?

According to the NMA's demand letter, which cites Common Crawl's Form 990 filings published on ProPublica, over 60% of Common Crawl's donated funds in 2024 came from entities directly or closely affiliated with generative AI companies or data brokers. More than half of those donations came from three sources: Anthropic, OpenAI, and the Schmidt Foundation. The NMA uses this funding picture, combined with liability indemnity clauses in Common Crawl's terms of use that explicitly cover 'developing, training, or deploying AI Systems,' to challenge the organization's characterization as a neutral public archive.

Playwire Strategy Team

May 5, 2026

Show Editorial Policy

Editorial Policy

All of our content is generated by subject matter experts with years of ad tech experience and structured by writers and educators for ease of use and digestibility. Learn more about our rigorous interview, content production and review process here.

News Publisher Monetization Publisher Revenue Strategy AI Crawler Blocking Common Crawl AI Training Data

Common Crawl Is an AI Training Pipeline. Publishers Are Done Pretending Otherwise.

Ready to be powered by Playwire?

Maximize your ad revenue today!

Apply Now

Key Points
The News/Media Alliance formally demanded that Common Crawl remove publisher content, revise its terms of use, and explicitly prohibit AI training use of its archive.
Over 60% of Common Crawl's 2024 donated funds came from entities directly affiliated with generative AI companies, including Anthropic, OpenAI, and the Schmidt Foundation.
Common Crawl's opt-out registry is buried in a footnote and carries no enforceable directive against AI training use.
Publishers blocking AI crawlers via robots.txt faced a 23.1% monthly visit decline in early 2026, with no corresponding reduction in AI citations.
Whatever you decide about blocking, the traffic you still have needs to be working harder than ever.

What Happened

The News/Media Alliance sent a formal demand letter to Common Crawl on April 29, 2026. The letter, first reported by Bloomberg and covered in depth by PPC Land, demands four specific actions from the nonprofit web archive organization.

The NMA is asking Common Crawl to remove publisher content upon request, publish a clear statement that it does not own or authorize use of scraped content, revise its terms of use to explicitly prohibit AI training use, and add enforceable warnings to its opt-out registry. Exhibit A of the letter includes hundreds of domain names from publishers ranging from NBCUniversal and CNN to McClatchy, Vox Media, Ziff Davis, USA Today, and dozens of regional outlets.

See It In Action:
WaPo Cuts 300 Staff as AI Search Erodes Publisher Traffic: A real-world case study in what AI-driven traffic erosion looks like at scale for a major news publisher.
Agency Revenue Drops Signal AI Search Traffic Shift: How agency revenue data is revealing the downstream impact of AI search on publisher monetization.
News Publishers Ad Revenue Resource Center: Monetization strategy, tools, and guidance built specifically for news publishers under traffic pressure.

Why This Matters More Than Another Open Letter

Common Crawl is not just a passive archive. It is infrastructure. OpenAI used it to train GPT-3 in 2020. Google used the C4 subset to develop what became Bard. GPT-3.5, built on Common Crawl data, became the foundation for ChatGPT. The NMA's letter cites these connections explicitly, using the AI companies' own research papers as documentation.

The funding picture makes the "neutral nonprofit" framing hard to sustain. According to the NMA's letter, drawing on Common Crawl's Form 990 filings published on ProPublica, over 60% of donated funds in 2024 came from entities directly or closely affiliated with generative AI companies or data brokers. More than half of those donations came from three sources: Anthropic, OpenAI, and the Schmidt Foundation.

The indemnity clauses buried in Common Crawl's terms of use are the most telling detail. Those terms already cover the use of crawled content for "developing, training, or deploying AI Systems" and include liability shields for "infringement or misappropriation of any third party's patent, trademark, copyright." The NMA's letter makes the obvious point: you don't write liability protection for something you don't expect to happen.

Essential Background Reading:
AI Crawler Resource Center for Publishers: The full hub for understanding how AI crawlers work, what they're taking, and what publishers can do about it.
AI Scraping vs. Traditional SEO Crawling: Why AI scrapers and search crawlers aren't the same thing, and why that distinction matters for your robots.txt decisions.
Ad Tech Crawlers You Should Never Block: A guide to distinguishing revenue-generating bots from content-consuming ones before you start blocking.
Playwire on Generative AI: Our position on generative AI and what it means for publishers partnering with us.

The Opt-Out Problem Is Structural, Not Administrative

Common Crawl does have an opt-out registry. Publishers can request exclusion from future crawls. The NMA acknowledges this, then explains exactly why it doesn't work.

The registry is listed as one of 27 subsections at the bottom of Common Crawl's homepage. It contains no directive to developers using the archive, no prohibition on AI training use, and no enforcement mechanism. Some NMA members submitted removal requests more than two and a half years before this letter was filed. Their content remains in the archive.

The robots.txt path has its own failure mode. Common Crawl states it will honor properly configured robots.txt files, but the archive already contains years of content scraped before those directives existed. A publisher who blocked Common Crawl's crawler in 2024 did not un-scrape their 2019 journalism.

Here is where the situation stands for publishers weighing their options on blocking AI crawlers:

Approach	Covers Future Crawls	Removes Existing Archive Content	Carries Enforcement Risk
robots.txt blocking	Yes, if honored	No	None
Common Crawl opt-out registry	Partially	Not reliably	None
NMA formal demand	Requested	Requested	Uncertain
Legal action	Depends on outcome	Depends on outcome	High

No single mechanism closes the historical gap. That's what the NMA is actually trying to force Common Crawl to address.

Related Content:
AI Training vs. AI Search Crawlers: Does blocking AI training crawlers actually hurt your referral traffic from AI search tools? The data-driven answer.
How AI Crawlers Impact Entertainment Website Traffic and Ad Revenue: Vertical-specific analysis of crawler impact for entertainment publishers.
AI Crawler Impact on Lifestyle Publisher Traffic: What the traffic data actually shows for lifestyle publishers navigating the crawler problem.
Big Tech's AI Licensing Report Card: Where the major AI licensing deals stand and what publishers should actually do in response.

What Publishers Should Do Right Now

The legal and policy fight will play out on its own timeline. Publishers need to make practical decisions now.

A few things worth doing regardless of how the NMA's demands are received:

Check the opt-out registry: Confirm your domains are listed. The NMA's Exhibit A is a model. If you haven't formally submitted a removal request, do it now so the record exists.
Audit your robots.txt: Make sure CCBot is explicitly disallowed if that's your position. Common Crawl's crawler user agent is CCBot. This won't remove historical content, but it stops future scraping.
Document your removal requests: Dates, methods, and any responses. The legal standing problem courts have identified requires demonstrable harm. A paper trail matters.
Evaluate the blocking tradeoff carefully: Research published in early 2026 found that publishers who blocked AI crawlers via robots.txt saw a 23.1% monthly visit decline. Blocking costs traffic. Not blocking costs content. There's no clean answer.

If you want to assess where your crawler protection stands right now, our AI Crawler Protection Grader gives you a fast, technical read on your current configuration. The AI Crawler Resource Center has the implementation details if you want to go deeper.

Next Steps:
AI Crawler Protection Grader: Get a fast technical read on how well your current robots.txt and blocking configuration actually protects your content.
The AI Search Reckoning: Why publishers need a coherent strategy rather than reactive blocking decisions.
AI Traffic Is the New SEO: How to position your content strategy to capture referral traffic from AI search tools.
AI and Publishers Resource Center: The broader strategy hub covering everything from crawler decisions to monetization in an AI-driven traffic environment.

The Traffic You Have Still Needs to Work

Publishers are losing ground on two fronts simultaneously. AI companies are consuming content without sending traffic back. Cloudflare data from August 2025 showed Anthropic crawling 38,000 pages for every single referred visit. Training-related crawling accounted for nearly 80% of all AI bot activity.

Blocking doesn't fully solve that, either. The 23.1% monthly visit decline among publishers who blocked AI crawlers means the traffic pool is shrinking regardless of which path you choose. Agency revenue data signals this AI search traffic shift is already hitting publisher bottom lines.

That makes yield optimization on your remaining traffic more important, not less. Every session carries more weight when total sessions are under pressure. RPS matters more when volume is declining. If your monetization stack isn't pulling maximum value from the traffic you do have, you're compounding the problem. Publishers navigating this shift are optimizing for AI referrals as the new SEO while shoring up yield on existing sessions.

We work with publishers across gaming, news, entertainment, and education to make sure that doesn't happen. The conversation about what your current inventory is actually worth is worth having now, not after another quarter of traffic erosion. Talk to us.

Share this article

News Publisher Monetization Publisher Revenue Strategy AI Crawler Blocking Common Crawl AI Training Data

Self-Service or Managed Service?

Flex Suite

Get in Touch

Common Crawl Is an AI Training Pipeline. Publishers Are Done Pretending Otherwise.

Editorial Policy

Ready to be powered by Playwire?

Key Points

What Happened

See It In Action:

Why This Matters More Than Another Open Letter

Essential Background Reading:

The Opt-Out Problem Is Structural, Not Administrative

Related Content:

What Publishers Should Do Right Now

Next Steps:

The Traffic You Have Still Needs to Work

Related Articles