How to Block AI Scrapers: The Legal Landscape Publishers Need to Know in 2025
December 8, 2025
Editorial Policy
All of our content is generated by subject matter experts with years of ad tech experience and structured by writers and educators for ease of use and digestibility. Learn more about our rigorous interview, content production and review process here.
Key Points
- Over 50 AI-related copyright lawsuits have been filed against major tech companies, with publishers like The New York Times, Ziff Davis, and Condé Nast leading the charge against OpenAI, Anthropic, and others.
- Publishers can block AI scrapers through robots.txt modifications, Cloudflare's AI blocker, and other technical measures, though effectiveness varies by bot compliance.
- Licensing deals with AI companies are creating a new revenue stream, with agreements ranging from $1 million to over $250 million annually for major publishers.
- The EU AI Act now requires transparency about training data, giving publishers new tools to understand how their content is used.
- Traffic declines from AI Overviews and zero-click searches are accelerating, making protection of remaining organic traffic essential for ad revenue sustainability.
The Great Content Grab is Under Legal Fire
Publishers spent decades building audiences, creating quality content, and perfecting their craft. Then generative AI showed up and decided to help itself to the buffet without paying the cover charge.
The situation has prompted an unprecedented wave of legal action. Publishers are no longer willing to watch their content feed AI models that may ultimately reduce traffic to their sites. The courts are now the battleground, and publishers are showing up with lawyers.
The New York Times fired the opening salvo in late 2023, suing OpenAI and Microsoft for allegedly using its journalism to train ChatGPT. Since then, the floodgates have opened. Ziff Davis, publisher of IGN and Mashable, filed suit alleging OpenAI continued scraping their sites even after they implemented blocking measures. In 2025, a coalition including Condé Nast, The Atlantic, Forbes, and Vox Media filed against Canadian AI startup Cohere for using their articles to train large language models.
Reddit has joined the fight as well, suing both Anthropic and Perplexity for allegedly scraping millions of user posts without authorization. The Reddit case against Perplexity specifically names third-party entities that allegedly helped circumvent platform protections. For publishers weighing their options, our complete guide to AI crawlers covers whether to block, allow, or optimize your approach for maximum revenue.
Need a Primer? Read this first:
- The Real Cost of Blocking AI: Traffic and Revenue Impact Analysis: Understand the revenue implications before deciding on your AI blocking strategy
- AI Crawlers: Block, Allow, or Optimize?: A complete guide to evaluating your approach to AI crawlers for maximum revenue
Where the AI Copyright Lawsuits Stand Right Now
The legal landscape is complex and evolving rapidly. Understanding which cases are active helps publishers gauge the direction of AI copyright law and plan their content protection strategies accordingly.
Here is a snapshot of the current state of major AI copyright litigation:
Case | Status | Key Issue |
NY Times v. OpenAI/Microsoft | Active, discovery phase | Direct copyright infringement for training data |
Publishers v. Cohere | Survived dismissal motion | 14 publishers allege content theft for LLM training |
Ziff Davis v. OpenAI | Consolidated in MDL | Robots.txt allegedly ignored, continued scraping |
Reddit v. Anthropic | Active | Claimed scraping of 100K+ posts without permission |
Reddit v. Perplexity | Active | Industrial-scale scraping allegations |
Thomson Reuters v. Ross Intelligence | Fair use defense rejected | Precedent-setting ruling against AI training |
The Thomson Reuters case deserves special attention. A federal judge rejected the fair use defense in early 2025, marking a significant warning shot for AI companies claiming that scraping content falls under fair use protections.
Most cases remain in discovery or pre-trial phases. Recent developments include OpenAI being ordered to hand over 20 million ChatGPT chat logs to The New York Times as part of discovery. In a separate proceeding, OpenAI must disclose internal communications about the deletion of datasets allegedly sourced from pirated books, with plaintiffs arguing this could demonstrate willful infringement.
Why Fair Use Claims Are Falling Flat
AI companies have largely hung their hats on fair use defenses. They argue that training models on copyrighted content is transformative and therefore legal. Publishers counter that scraping their work for commercial gain, then using it to compete with them, is anything but fair.
The legal arguments break down into several key areas that directly impact how publishers should approach content protection:
- Transformative use: AI companies claim their models create new outputs, not copies. Publishers argue that ingesting and regurgitating content is not transformation.
- Commercial nature: Training models that generate billions in revenue is clearly commercial. This weighs against fair use claims.
- Market harm: Publishers point to declining traffic and the cannibalization of their content as direct market damage. Understanding the scope of this impact is critical, our traffic and revenue impact analysis breaks down the real cost of blocking AI versus allowing access.
- Amount used: Many AI models trained on entire websites or databases, not small excerpts. Wholesale copying rarely qualifies as fair use.
Recent court decisions suggest judges are not buying the fair use argument wholesale. The Thomson Reuters ruling specifically rejected this defense in an AI training context. Anthropic recently agreed to pay $1.5 billion to settle a lawsuit related to using pirated books for training, after a judge ruled that buying a copy of a book after previously obtaining it illegally would not absolve the company of liability.
Related Content:
- How to Block AI from Scraping Your Website: Step-by-step technical implementation instructions for blocking AI crawlers
- The Complete List of AI Crawlers: Every AI bot you should consider blocking and what each one is used for
- Google's Recent Algorithm Updates: How Google's changes factor into the evolving traffic landscape for publishers
How to Block AI Scrapers from Your Site
You do not have to wait for courts to protect your content. Technical measures can help you take control now. Most major AI companies have documented ways for publishers to opt out of training data collection, and implementing an AI blocker strategy should be a priority. For step-by-step implementation instructions, our technical guide walks through exactly how to block AI from scraping your website.
The robots.txt file remains your first line of defense. This simple text file tells crawlers which parts of your site they can access. Adding specific rules blocks AI bots from indexing your content.
Here are the primary AI user agents you can block:
- GPTBot: OpenAI's web crawler for training data
- ClaudeBot: Anthropic's crawler for Claude AI models
- Google-Extended: Controls use for Gemini and Vertex AI training (separate from search indexing)
- CCBot: Common Crawl's open repository crawler
- Bytespider: ByteDance's AI data scraper
- FacebookBot: Meta's crawler for AI language models
- PerplexityBot: Perplexity AI's search crawler
- Applebot-Extended: Apple's crawler for AI and LLM training
For a comprehensive breakdown of every bot you should consider blocking, our complete list of AI crawlers details how to block each one and what they're used for.
The implementation is straightforward. Add disallow rules to your robots.txt file for each user agent you want to block. Major publishers including The New York Times, Wall Street Journal, Reuters, and Vox have already implemented these blocks.
Sample Robots.txt Configuration to Block AI Scrapers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Bytespider
Disallow: /
However, robots.txt compliance is voluntary. Well-established companies like OpenAI and Google generally follow these protocols. Smaller or less scrupulous bots may ignore them entirely.
The Cloudflare AI Blocker Solution
Cloudflare now offers a one-click AI blocker feature that provides additional protection for sites on their platform. Navigate to Security > Bots in the Cloudflare dashboard and enable the toggle labeled "AI Scrapers and Crawlers." This feature is available on all plans, including the free tier, and will automatically update as Cloudflare identifies new offending bots.
According to Cloudflare's analysis, AI bots accessed around 39% of the top one million Internet properties in June 2024, but only 2.98% of these properties took measures to block or challenge those requests. The gap represents a significant opportunity for publishers to protect their content.
The Licensing Alternative: Getting Paid for Your Content
While some publishers are fighting in court, others are making deals. AI content licensing has emerged as a new revenue stream, with major agreements now totaling billions of dollars across the industry.
OpenAI has been the most aggressive dealmaker, signing agreements with publishers including News Corp, The Atlantic, Vox Media, Dotdash Meredith, Condé Nast, Hearst, and others. Google is now entering the fray, recruiting approximately 20 national news outlets for a pilot licensing project. Meta recently announced deals with USA Today, Fox News, and Le Monde.
Deal structures vary widely, and understanding the landscape helps publishers evaluate their options:
Publisher/Agreement | Reported Value | Structure |
News Corp / OpenAI | ~$250M over 5 years | Cash + technology credits |
Reuters / Various AI | ~$65M+ | Upfront + recurring payments |
Reddit / Google | ~$60M annually | Content licensing |
Dotdash Meredith / OpenAI | ~$16M annually | Ongoing license |
Axel Springer / OpenAI | ~$25M | Upfront + variable fees |
Most deals include two components: a fixed upfront payment and variable payouts based on ongoing usage. Some publishers have secured "most favored nation" clauses that allow renegotiation if competitors get better terms. Industry analysis suggests the average deal size across 34 tracked agreements is approximately $24 million, with total commitments approaching $3 billion.
The question facing publishers is whether licensing revenue will offset traffic losses over time. Critics argue that accepting payment today may legitimize content use while AI products continue eroding organic traffic.
The EU AI Act Changes the Game for Publisher Rights
European regulators have moved faster than their American counterparts. The EU AI Act, which became effective for general-purpose AI models in August 2025, introduces significant new requirements that affect how AI companies use publisher content.
Key provisions for publishers include:
- Training data transparency: AI providers must publish summaries of content used for training, including information about copyrighted materials. The European Commission published a mandatory template in July 2025 specifying exactly what information AI providers must disclose.
- Opt-out respect: The Act references the Copyright Directive's text and data mining exception, requiring AI companies to respect publisher reservations of rights.
- Enforcement teeth: Penalties can reach 3% of worldwide annual turnover or €15 million, whichever is higher.
Publishers in the EU now have regulatory backing for their rights. AI companies placing models on the EU market must document how they handled copyright compliance throughout development. This transparency requirement could provide evidence useful in litigation across jurisdictions.
The Code of Practice implementing the AI Act has faced criticism from creative industry groups who argue it does not go far enough. The battle over implementation details continues, but the framework represents a meaningful step toward publisher protection.
Why Traffic Decline Makes AI Blocking Essential
The stakes for publishers extend beyond principle. Traffic is money, and AI features are actively eroding organic search visits. Understanding the scope of traffic decline underscores why implementing an AI blocker strategy matters for ad revenue. Publishers and content creators need to understand how Google's recent algorithm updates factor into this changing landscape.
Research shows that AI Overviews appearing in Google search results reduce click-through rates significantly. Studies throughout 2024 and 2025 documented reductions ranging from 34% to 47% when AI summaries appear. Zero-click searches have grown from 56% to nearly 69% for news-related queries.
The impact varies by content category:
Content Type | Traffic Impact |
Educational content | Up to 49% decline (Chegg reported data) |
Lifestyle publishers | CTR dropped from 5.1% to 0.6% in some cases |
Travel/Tourism | 20% YoY decline in search referrals |
News/Media | 17% YoY decline in search referrals |
Non-news brands | 14% median decline |
Digital Content Next, representing approximately 40 premium publishers, found that median year-over-year referral traffic from Google Search declined 10% overall across an eight-week study period in 2025, with news brands down 7% and non-news brands down 14%.
These traffic declines translate directly to reduced ad revenue. Every visitor who gets their answer from an AI Overview is a visitor who never sees your ads, never generates an impression, and never contributes to your bottom line.
What Publishers Should Do Right Now
The legal and technical landscape is shifting rapidly. Waiting for clarity may cost you traffic and revenue in the meantime. Publishers need a multi-pronged approach to protect their content and revenue.
- Implement AI blocking now: Update your robots.txt to block major AI crawlers. If you use Cloudflare, enable their AI blocker feature. This takes minutes and costs nothing.
- Document your position: Keep records of when you implemented blocking measures. This evidence could be important in any future licensing negotiations or legal proceedings.
- Monitor your traffic: Track organic search referrals separately from other sources. Watch for declines that correlate with AI feature rollouts. Understanding your exposure helps you plan.
- Evaluate licensing opportunities: AI companies are actively seeking deals. Know your content's value and be prepared to negotiate from a position of understanding.
- Consider the citation alternative: Rather than blocking entirely, some publishers are optimizing their content to earn citations from AI tools. Our guide explains how to get AI tools to cite your website as an alternative to blocking.
- Maximize remaining traffic value: The visitors you do receive are increasingly valuable. Strong ad monetization becomes critical when volume declines. Understanding the fundamentals of programmatic advertising helps publishers and advertisers navigate this evolving landscape.
- Stay informed: The legal landscape is evolving monthly. Join industry associations like Digital Content Next or the News/Media Alliance that track these issues and advocate for publisher interests.
Frequently Asked Questions About Blocking AI Scrapers
Does blocking AI scrapers affect my Google search rankings?
Blocking AI crawlers like GPTBot or ClaudeBot does not affect your Google Search rankings. These are separate from Googlebot, which indexes content for search results. The Google-Extended user agent specifically controls AI training data collection without impacting search visibility.
How effective is robots.txt at blocking AI scrapers?
Robots.txt is effective against reputable AI companies like OpenAI, Google, and Anthropic that respect the protocol. However, it relies on voluntary compliance. Smaller or less scrupulous bots may ignore robots.txt entirely, which is why combining it with tools like Cloudflare's AI blocker provides stronger protection.
Can I block AI scrapers and still license my content?
Yes. Many publishers implement blocking as a default while negotiating licensing agreements. Blocking establishes your position that content access requires permission, which can strengthen your negotiating leverage. Once a deal is signed, you can selectively allow access to licensed partners.
What happens to content AI already scraped before I blocked them?
Blocking AI scrapers prevents future collection but does not remove content already ingested. This is a key point in ongoing litigation. Publishers suing AI companies are seeking damages for past unauthorized use, and some are demanding that AI models trained on their content be destroyed.
Next Steps:
- How to Get AI Tools to Cite Your Website: Consider the citation alternative to blocking for content visibility in AI tools
- Demystifying Ad Curation: Maximize value from every impression as traffic patterns shift
Protecting Your Revenue in the AI Era
The legal battles will play out over years. Regulatory frameworks will continue evolving. Technical measures will improve. What publishers can control right now is how well they monetize the traffic they still receive.
When organic traffic declines, every session matters more. Publishers need to ensure their ad monetization strategy captures maximum value from each visitor. This means optimizing ad layouts, accessing premium demand sources, and implementing sophisticated yield management. Understanding ad curation has become essential for publishers looking to maximize value in 2025.
Playwire helps publishers maximize ad revenue from their existing traffic. Our RAMP Platform combines AI-driven yield optimization with access to premium direct demand, ensuring you capture the full value of every impression. While you cannot control what AI companies do with their models, you can control how effectively you monetize the audience you have built.
Apply now to see how Playwire can help protect and grow your ad revenue in an increasingly uncertain traffic environment.


