Don't have time to read this?
Take a copy with you
Select Your Chapter
Publishers are caught in an uncomfortable position. AI companies scrape your content to train their models, then those same AI systems compete directly with your site for user attention. It's like giving away your playbook and then wondering why you're losing the game.
The numbers paint a stark picture. According to Digital Content Next research, Google Search referrals to premium publishers dropped 10% year-over-year through mid-2025, with non-news brands experiencing 14% declines. AI Overviews now appear on a growing percentage of search queries, providing instant answers that eliminate the need for users to click through to source websites.
Yet blocking AI entirely isn't a simple solution. AI search engines are becoming significant referral sources, and being excluded from AI-generated responses could mean becoming invisible to an increasingly large segment of users. The question isn't whether to engage with AI crawlers. It's how to do so strategically.
AI crawlers are automated bots that scan websites to collect data for various purposes. Unlike traditional search engine crawlers that index content for discovery, many AI crawlers harvest content specifically for training large language models.
The distinction matters enormously for publishers. When Googlebot crawls your site, it helps users find your content through search results, driving traffic back to you. When GPTBot crawls your site, it feeds your content into models that may answer user questions directly, potentially eliminating the need to visit your site at all.
AI crawlers fall into several functional categories, each with different implications for your content strategy and revenue.
Need a Primer? Read these first:
- How to Block AI from Scraping Your Website: A technical deep-dive into implementation methods for blocking AI scrapers
- How to Block AI Bots with robots.txt: The foundational guide to using robots.txt for AI crawler management
Learning how to block AI scrapers protects your intellectual property from being used to train competing systems. For publishers who've invested heavily in original content creation, watching that content power AI responses that bypass your site feels like subsidizing your own competition.
Here's the uncomfortable truth: blocking AI crawlers may reduce your visibility in AI-powered search experiences. AI search engines like ChatGPT Search and Perplexity drive meaningful referral traffic, and that traffic is growing.
If you block AI training crawlers but allow search crawlers, you might maintain some AI visibility while protecting your content from model training. This selective approach requires understanding which bots do what, information that isn't always transparent.
Allowing AI crawlers positions your content for citation in AI-generated responses. When AI systems reference your site as a source, you gain visibility and potentially traffic from users seeking deeper information.
Allowing access means your content trains models that may eventually compete with you. AI systems improve by learning from quality sources. Your excellent content makes AI responses better, potentially reducing user need to visit your site.
This creates a paradox: the better your content, the more valuable it is for AI training, and the more likely AI systems can satisfy users without sending them to you.
The robots.txt file remains the primary mechanism for communicating crawler preferences and is the first tool publishers should understand when learning how to block AI scrapers. This simple text file lives at your site's root directory and tells well-behaved bots which areas they can access.
To block AI crawlers, add User-agent directives followed by Disallow rules. Here's a template that blocks major AI training crawlers:
# Block OpenAI Training Crawler
User-agent: GPTBot
Disallow: /
# Block Anthropic Crawlers
User-agent: anthropic-ai
Disallow: /
User-agent: ClaudeBot
Disallow: /
# Block Google AI Training
User-agent: Google-Extended
Disallow: /
# Block Common Crawl
User-agent: CCBot
Disallow: /
# Block Meta AI
User-agent: FacebookBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
# Block Perplexity Training
User-agent: PerplexityBot
Disallow: /
# Block ByteDance
User-agent: Bytespider
Disallow: /
Many publishers prefer selective approaches that differentiate between training and search crawlers. This template blocks training bots while allowing search-related access:
# Allow OpenAI Search Bot
User-agent: OAI-SearchBot
Allow: /
# Allow ChatGPT User-Triggered Fetches
User-agent: ChatGPT-User
Allow: /
# Block OpenAI Training
User-agent: GPTBot
Disallow: /
# Allow Bing (which powers many AI searches)
User-agent: Bingbot
Allow: /
# Block AI Training Crawlers
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
Some publishers allow AI crawlers access to specific sections while protecting others:
# Allow GPTBot access to blog only
User-agent: GPTBot
Allow: /blog/
Disallow: /
# Block training crawlers from premium content
User-agent: CCBot
Disallow: /premium/
Disallow: /members/
Allow: /
Understanding which crawlers belong to which organizations helps you make informed blocking decisions. The following table covers the most significant AI crawlers publishers encounter when implementing strategies to block AI scrapers.
User Agent | Operator | Purpose | Respects robots.txt | Notes |
GPTBot | OpenAI | Model training | Yes | Primary training crawler |
OAI-SearchBot | OpenAI | Search indexing | Yes | Powers ChatGPT search |
ChatGPT-User | OpenAI | Real-time fetching | Generally | User-triggered requests |
ClaudeBot | Anthropic | Model training | Yes | Training data collection |
anthropic-ai | Anthropic | Bulk training | Yes | Legacy crawler |
Claude-Web | Anthropic | Web crawling | Unclear | Limited documentation |
Google-Extended | AI training | Yes | Gemini training data | |
Googlebot | Search indexing | Yes | Do not block for SEO | |
Bingbot | Microsoft | Search indexing | Yes | Powers Copilot search |
PerplexityBot | Perplexity | Search indexing | Sometimes | Mixed compliance reports |
CCBot | Common Crawl | Dataset creation | Yes | Used by many AI systems |
Bytespider | ByteDance | AI training | Sometimes | Powers Doubao/TikTok AI |
FacebookBot | Meta | AI training | Yes | |
Meta-ExternalAgent | Meta | AI features | Yes | Newer Meta crawler |
Applebot-Extended | Apple | AI training | Yes | Apple Intelligence |
DuckAssistBot | DuckDuckGo | AI answers | Yes | DuckDuckGo AI features |
Amazonbot | Amazon | AI training | Yes | Alexa/AWS AI training |
The robots.txt file operates on the honor system. Crawlers can ignore your preferences entirely. For publishers wanting enforced blocking, several technical approaches provide stronger protection when learning how to block AI scrapers effectively.
Apache servers can block AI crawlers using .htaccess files. This approach returns error pages to blocked crawlers:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|CCBot|anthropic-ai|Bytespider) [NC]
RewriteRule .* - [F,L]
This configuration returns a 403 Forbidden response to matching user agents, preventing content access regardless of robots.txt compliance.
Nginx servers can implement similar blocking through configuration files:
if ($http_user_agent ~* "(GPTBot|ClaudeBot|CCBot|anthropic-ai|Bytespider|PerplexityBot)") {
return 403;
}
Cloudflare offers managed AI crawler blocking through their dashboard. This approach provides several advantages over manual configuration.
According to Cloudflare's July 2025 announcement, new websites on their platform now have AI crawlers blocked by default, representing a significant shift in how the web handles AI scraping.
Some AI companies publish IP ranges for their crawlers, enabling firewall-level blocking. OpenAI and a few others provide verifiable IP information, though many operators don't.
The challenge with IP blocking: legitimate crawlers may share infrastructure with other services, and blocking entire IP ranges could cause unintended consequences.
Related Content:
- Selective AI Blocking: Strategies for allowing beneficial bots while blocking training crawlers
- The Complete List of AI Crawlers: Reference guide to all major AI crawlers and their purposes
- Using Cloudflare to Block AI Crawlers: Step-by-step setup guide for Cloudflare's AI Crawl Control
- How AI Crawling Affects Your Ad Revenue: Data-driven analysis of AI crawler impact on publisher monetization
- The Real Cost of Blocking AI: Traffic and revenue implications of AI blocking decisions
The legal framework surrounding AI crawling remains unsettled. Publishers should understand the current landscape while recognizing it continues to evolve.
Using copyrighted content to train AI models without permission exists in a legal gray area. Several high-profile lawsuits argue that AI training constitutes copyright infringement, while AI companies counter that training represents fair use.
The robots.txt file isn't a legal document. It expresses preferences but doesn't create enforceable rights. Violating robots.txt directives may constitute trespass or breach of computer access laws in some jurisdictions, but case law remains limited.
Many publishers include terms of service provisions prohibiting automated scraping. These terms may provide stronger legal footing than robots.txt alone, though enforcement remains challenging.
The EU AI Act and similar regulations may eventually address AI training data rights. Publishers should monitor regulatory developments that could create clearer frameworks for controlling content use.
The decision to block AI affects traffic patterns and monetization potential. Understanding these implications helps publishers make informed choices aligned with their business models.
Traditional search referrals continue declining across the publishing industry. According to Digiday's analysis of DCN member data, publishers experienced year-over-year traffic declines between 1% and 25%, with the median decline at 10% overall. This decline accelerated with the rollout of AI Overviews and similar features that answer queries directly in search results.
AI search engines drive meaningful but still modest referral traffic compared to traditional search. According to Cloudflare's analysis, Anthropic's Claude made nearly 71,000 crawl requests for every single referral back to publisher sites. The trajectory shows some growth in AI referrals, making this an emerging channel worth monitoring.
Publishers cited in AI responses generally see some referral traffic from users seeking more information. However, citation doesn't guarantee clicks. Many users accept AI summaries without visiting sources.
Traffic volume directly impacts ad revenue. Blocking AI crawlers that might cite your content could reduce referral traffic, though the relationship isn't straightforward. Consider these factors when evaluating revenue impact:
There's no universal right answer for managing AI crawlers. The optimal approach depends on your content type, business model, and strategic priorities.
Start by answering these questions:
Once you've determined your strategy, follow these implementation steps:
Implementing a crawler strategy requires ongoing monitoring to understand its effects and make adjustments.
Several tools help publishers understand AI crawler activity:
Publishers who choose to allow AI crawlers can optimize their content for better AI visibility and citation. This approach treats AI systems as another discovery channel worth cultivating.
AI systems favor well-structured content with clear, extractable facts. Format content with AI parsing in mind:
Structured data helps AI systems understand your content's meaning and context. Implementing appropriate schema markup may improve how AI systems interpret and cite your content.
You can optimize for AI visibility while still maintaining boundaries. Allow access to content you want cited while protecting premium or sensitive material. This hybrid approach maximizes exposure where beneficial while maintaining control where necessary.
The landscape of AI crawler management continues evolving rapidly. Publishers should stay informed about emerging tools and standards that may affect their strategies.
Cloudflare introduced a Content Signals Policy framework that allows publishers to express preferences beyond simple allow/block directives. This emerging standard lets publishers specify:
This framework moves beyond the binary allow/block model toward more nuanced content governance. As adoption grows, publishers gain finer control over how their content is used.
Some infrastructure providers now offer pay-per-crawl options, allowing publishers to monetize AI crawler access directly. Cloudflare's Pay Per Crawl system, announced in July 2025, enables publishers to set prices for AI bot access rather than blocking entirely.
This model treats content as a licensable asset. AI companies wanting access pay for the privilege, creating a revenue stream separate from advertising. Whether these fees meaningfully compensate publishers for content use remains to be seen, but the model represents a creative approach to the AI content dilemma.
The problem of crawler verification grows increasingly important. Any bot can claim to be GPTBot by setting an appropriate user-agent string. Verifying actual identity requires additional steps.
Some approaches to crawler verification include:
As spoofing becomes more common, verification becomes more important. Publishers relying solely on user-agent blocking may find their content accessed by bots claiming legitimate identities.
Publishers implementing AI crawler strategies often encounter pitfalls that undermine their goals. Avoiding these common mistakes improves effectiveness.
Some publishers attempt to block AI by blocking all bots, inadvertently preventing search engine indexing. Googlebot and Bingbot should generally remain unblocked for SEO purposes.
Review your robots.txt carefully to ensure you're blocking specific AI crawlers rather than all crawlers. The distinction matters enormously for search visibility.
Robots.txt on your main domain doesn't affect subdomains. If you run content on multiple subdomains, each needs its own robots.txt file with appropriate directives.
Similarly, different environments (staging, development) may have different robots.txt files. Ensure consistency across all production environments where you want blocking enforced.
After implementing robots.txt changes, test that they work as intended. Google Search Console offers a robots.txt tester. Third-party tools can verify how various crawlers interpret your directives.
Testing catches syntax errors and logical problems before they cause unintended consequences. A misplaced wildcard or typo can block more traffic than intended or fail to block what you wanted to exclude.
Your server logs reveal which crawlers actually access your content. Review logs regularly to verify your blocking works and identify crawlers you may have missed.
Some AI crawlers use undocumented user agents or change their identification over time. Log analysis helps you stay current with actual crawler activity.
Some publishers block all AI crawlers preemptively, then wonder why they don't appear in AI search results. If you want AI visibility, you need to allow relevant crawlers access.
Start with a balanced approach that allows search-related crawlers while blocking training crawlers. Adjust based on observed outcomes rather than blocking everything by default.
Next Steps:
- Should You Block AI Crawlers?: Use our decision framework to determine the right strategy for your site
- How to Get AI Tools to Cite Your Website: Alternative approach: optimize for AI visibility instead of blocking
- AI Traffic is the New SEO: Emerging strategies for optimizing content for AI-powered discovery
Publishers commonly ask similar questions when developing their AI crawler strategies. Here are answers to the most frequent inquiries about how to block AI scrapers effectively.
Blocking AI-specific crawlers like GPTBot doesn't directly affect traditional search rankings. Google's main search crawler (Googlebot) is separate from Google-Extended, which feeds AI training. You can block Google-Extended while allowing Googlebot without SEO penalty.
However, blocking AI crawlers may reduce visibility in AI-powered search experiences. As these experiences grow in popularity, reduced AI visibility could indirectly affect your overall discoverability.
No. While major AI companies like OpenAI and Anthropic claim their crawlers respect robots.txt, compliance isn't universal. Some crawlers have been documented ignoring robots.txt directives entirely.
For guaranteed blocking, server-level or firewall-level blocking provides stronger enforcement than robots.txt alone.
Review your robots.txt quarterly at minimum. The AI landscape evolves rapidly, with new crawlers appearing regularly and existing crawlers changing behavior.
Major events that should trigger robots.txt review include: new AI product launches, changes to AI company crawler policies, significant changes to your content strategy, and any observed unusual bot behavior in your logs.
Yes. You can configure robots.txt with different rules for different user agents. This selective approach allows you to, for example, permit OpenAI's search crawler while blocking their training crawler.
The key is understanding which user agents correspond to which functions. Use the reference table earlier in this guide to identify which crawlers serve which purposes.
Blocking AI crawlers prevents future access but doesn't remove content already collected. AI companies don't typically offer mechanisms for requesting removal of previously scraped content from training datasets.
This reality makes early blocking decisions particularly important. Content crawled before you implement blocking remains in AI training pipelines.
This depends on your content and strategy. Some publishers block AI from premium content while allowing access to freely available material. Others block AI from their entire site.
Consider: What content represents your core value? What content might benefit from AI visibility? Answering these questions helps determine appropriate blocking scope.
Check your server access logs for user-agent strings matching known AI crawlers. Look for GPTBot, ClaudeBot, CCBot, and similar identifiers.
If you use Cloudflare or similar services, their dashboards may show AI crawler activity more accessibly than raw log analysis.
Not necessarily. The relationship between AI crawler access and traffic is complex and delayed. Blocking training crawlers may not affect traffic for months or longer, as it takes time for trained models to be deployed.
Blocking search-related crawlers may have faster effects on AI referral traffic, though this channel remains relatively small for most publishers currently.
The current tension between publishers and AI companies seems unsustainable. Content creators provide the raw material that makes AI systems valuable, yet receive little compensation while facing competition from those same systems.
Several developments may reshape this relationship over time.
Major publishers including The New York Times, News Corp, and others have signed licensing deals with AI companies. These agreements provide compensation for content use while granting AI companies legitimate access.
Smaller publishers may benefit as these deals establish market rates for content licensing. However, individual negotiations remain impractical for most publishers.
The EU AI Act and similar regulations may eventually require AI companies to compensate content creators or obtain explicit permission for training. U.S. copyright litigation could establish clearer legal frameworks for AI training rights.
Regulatory clarity would benefit publishers by establishing enforceable rules rather than relying on voluntary compliance with robots.txt preferences.
Publisher coalitions may develop collective bargaining power to negotiate with AI companies. Organizations like Digital Content Next have advocated for publisher interests in AI policy discussions.
Collective action could establish industry-wide standards for AI content use, reducing the burden on individual publishers to navigate these issues alone.
As blocking becomes more sophisticated, AI companies may develop more sophisticated scraping techniques. This arms race could drive development of more robust content protection mechanisms.
Alternatively, AI companies may shift toward properly licensed content sources as blocking becomes more prevalent and legally risky. Economic pressure may accomplish what ethical appeals haven't.
Whatever your AI crawler strategy, maximizing revenue from the traffic you do receive remains the fundamental goal. Whether traffic comes from traditional search, AI referrals, or other sources, effective monetization turns visitors into revenue.
Strong yield optimization ensures every visitor generates maximum value. This becomes increasingly important as traffic sources diversify and traditional channels face pressure.
Playwire's RAMP Platform provides publishers with the tools and expertise to maximize revenue from every traffic source. Our Revenue Intelligence® technology optimizes yield in real-time, while our team of yield operations experts provides strategic guidance tailored to your specific situation.
Publishers working with Playwire gain access to premium demand sources, advanced analytics that show exactly how content drives revenue, and ongoing optimization that adapts to changing market conditions. Whether you're navigating AI crawler decisions or optimizing your overall monetization strategy, having the right technology partner makes the difference between leaving money on the table and maximizing your revenue potential.
Ready to ensure your traffic generates maximum revenue? Apply to work with Playwire and see how our platform can amplify your ad revenue while you focus on creating great content.
The AI crawler landscape will continue evolving, but one thing remains constant: publishers who maximize the value of every visitor position themselves for success regardless of how discovery channels shift. Your content is valuable. Make sure your monetization strategy captures that value.
We'll email you a downloadable PDF version of the guide and you can read later.