What is the difference between AI training crawlers and AI search bots?

Training crawlers scrape content to build datasets for AI model development and give nothing back to publishers—no traffic, attribution, or compensation. AI search bots index content to power AI search engines and can drive referral traffic back to your site when users find your content through AI platforms.

Should publishers block all AI bots?

No, a blanket block of all AI bots may protect your content from unauthorized training, but it also cuts off emerging traffic sources from AI search platforms. A selective approach lets you maintain visibility in AI search results while blocking pure training crawlers.

Which AI bots should I allow on my website?

Allow user-triggered crawlers like ChatGPT-User and Perplexity-User, as well as AI search indexers like OAI-SearchBot and PerplexityBot. These bots drive measurable referral traffic back to your site. Block training crawlers like GPTBot, ClaudeBot, CCBot, and Google-Extended that take content without providing traffic return.

How do I implement selective AI bot blocking in robots.txt?

Create Disallow directives for training bots (GPTBot, anthropic-ai, Google-Extended, CCBot, Bytespider) and Allow directives for search bots (OAI-SearchBot, PerplexityBot) and user-triggered agents (ChatGPT-User, Perplexity-User). This configuration protects content from training while maintaining AI search visibility.

Is robots.txt enough to block AI crawlers?

Robots.txt compliance is voluntary, and research shows AI bots bypassed 13% of website block requests in Q2 2025. For actual enforcement, you need server-level controls like Nginx user-agent blocking or Cloudflare WAF rules as backup to your robots.txt directives.

How often should I update my AI bot blocking strategy?

Review your bot strategy at least monthly. The AI search landscape is moving too fast for quarterly reviews to catch meaningful changes. New bots appear constantly—for example, Anthropic merged their crawlers into ClaudeBot, and websites that didn't update gave the new bot unprecedented access.

Learning Center

Selective AI Blocking: How to Allow Beneficial Bots While Blocking Others

Q: What is the crawl-to-referral ratio for AI platforms?

The ratios vary dramatically across platforms. Anthropic crawled 38,000 pages for every visitor referred back to publishers, while OpenAI maintained a ratio of 1,091 crawls per referral. Perplexity's ratio sits at 194 crawls per visitor. These numbers reveal which platforms are taking the most while giving back the least.

Q: How much traffic comes from AI search platforms?

AI search is becoming a meaningful traffic channel. ChatGPT sent 243.8 million visits to 250 news and media websites in April 2025, up 98% from January. Blocking AI search crawlers entirely means voluntarily removing yourself from this emerging discovery channel.

Playwire Strategy Team

December 8, 2025

Show Editorial Policy

Editorial Policy

All of our content is generated by subject matter experts with years of ad tech experience and structured by writers and educators for ease of use and digestibility. Learn more about our rigorous interview, content production and review process here.

AI Blocking

Ready to be powered by Playwire?

Maximize your ad revenue today!

Apply Now

Key Points

AI crawlers fall into two distinct categories: training bots that scrape content for model development and search bots that can drive referral traffic back to your site.
A blanket block of all AI bots may protect your content from unauthorized training, but it also cuts off emerging traffic sources from AI search platforms.
Implementing a selective AI blocker filter strategy lets you maintain visibility in AI search results while blocking pure training crawlers.
Technical implementation requires understanding specific user agents and combining robots.txt directives with server-level controls for maximum effectiveness.
Monitoring and adjusting your block AI bots strategy is essential as the landscape evolves and new crawlers emerge regularly.

The All-or-Nothing Approach Is Leaving Money on the Table

Publishers face a genuine dilemma in the AI era. Block everything and you might protect your content from being absorbed into training datasets. Allow everything and your intellectual property feeds someone else's profit machine with nothing coming back. Neither extreme serves your revenue goals.

The good news? You don't have to choose between complete lockdown and open season. A nuanced approach exists that lets you welcome the bots bringing value while showing the door to those that only take. For publishers weighing this decision, understanding the real cost of blocking AI and its impact on traffic and revenue provides essential context before implementing any strategy.

Need a Primer? Read this first:
The Real Cost of Blocking AI: Understand the revenue implications before implementing any bot blocking strategy
The Complete Publisher's Guide to AI Crawlers: A comprehensive framework for deciding whether to block, allow, or optimize

Understanding the AI Bot Ecosystem

Before implementing any AI blocker filter strategy, you need to understand what you're actually dealing with. AI bots aren't a monolithic category. They serve fundamentally different purposes, and lumping them together is like treating all vehicles the same whether they're delivery trucks or getaway cars.

Training Crawlers: The Content Vacuum

Training crawlers exist to gather massive amounts of web content for developing and refining large language models. These bots scrape text, images, and structured data to build the datasets that power AI systems.

The key characteristic of training crawlers? They take without giving back. Your content goes in, and nothing comes out the other end toward your site. No traffic, no attribution, no compensation. Cloudflare data shows training now drives nearly 80% of AI bot activity, up from 72% a year ago. Knowing exactly which bots fall into this category is critical — our complete list of AI crawlers and how to block each one breaks down the full landscape.

Bot Name	Operator	Primary Purpose	Robots.txt Compliance
GPTBot	OpenAI	Model training data collection	Generally respects
ClaudeBot	Anthropic	Training data for Claude models	Generally respects
CCBot	Common Crawl	Dataset building for AI research	Generally respects
Google-Extended	Google	Gemini model training	Generally respects
Bytespider	ByteDance	Training data for Doubao/TikTok AI	Inconsistent

Search and Retrieval Bots: The Traffic Drivers

Search and retrieval bots serve a different purpose entirely. These crawlers index content to power AI search engines and real-time answer retrieval. When users ask questions through AI platforms, these bots help surface your content and can drive referral traffic back to your site.

Bot Name	Operator	Primary Purpose	Traffic Potential
OAI-SearchBot	OpenAI	ChatGPT Search indexing	Medium to High
ChatGPT-User	OpenAI	Real-time user query retrieval	High
PerplexityBot	Perplexity AI	Answer engine indexing	Medium
Perplexity-User	Perplexity AI	User-triggered content fetch	High

The distinction matters enormously for publishers. Blocking GPTBot might make sense for protecting your training data rights. Blocking OAI-SearchBot cuts you off from appearing in ChatGPT search results entirely.

The Revenue Case for Selective Blocking

Publishers who rely on ad revenue have a particularly compelling reason to think strategically about which bots to block. Your business model depends on traffic volume. More visitors mean more ad impressions, which translates directly to revenue. Understanding what you need to know about ad blocking rate helps contextualize why protecting traffic sources matters so much for your bottom line.

The Crawl-to-Referral Imbalance

Not all AI platforms treat publishers equally. According to Cloudflare data, the crawl-to-referral ratios vary dramatically across platforms. In July 2025, Anthropic crawled 38,000 pages for every visitor referred back to publishers, while OpenAI maintained a ratio of 1,091 crawls per referral. Perplexity's ratio sits at 194 crawls per visitor.

These numbers reveal which platforms are taking the most while giving back the least. This imbalance should inform your blocking strategy.

Traffic Sources You Cannot Afford to Ignore

AI search is becoming a meaningful traffic channel. Similarweb data shows ChatGPT sent 243.8 million visits to 250 news and media websites in April 2025, up 98% from January. Blocking these crawlers entirely means voluntarily removing yourself from an emerging discovery channel.

The publishers seeing the best results are those treating AI search as a new acquisition channel rather than purely a threat. This requires selective blocking rather than nuclear options. Some publishers are taking this a step further by learning how to get AI tools to cite your website as an alternative to blocking.

Related Content:
The Complete List of AI Crawlers: Detailed breakdown of every major AI bot and how to handle each one
How to Block AI Bots with Robots.txt: Deep dive into robots.txt syntax and advanced configurations
How to Get AI Tools to Cite Your Website: Optimize for AI discovery instead of blocking entirely

How to Implement Your Selective AI Blocker Filter

The technical implementation of selective blocking requires working at multiple levels. Robots.txt provides the foundation, but server-level controls add enforcement teeth.

Robots.txt: The Foundation Layer

Your robots.txt file remains the primary signal to well-behaved bots. HTTP Archive data from July 2025 shows that 94% of 12 million websites have a robots.txt file containing at least one directive. The problem? Many publishers still use all-or-nothing approaches. For a deep dive into this foundational technique, our guide on how to block AI bots with robots.txt covers everything from basic syntax to advanced configurations.

A selective configuration looks fundamentally different from a blanket block. Here's a template that blocks training crawlers while allowing search and user-triggered bots:

robots.txt

# Block model-training crawlers

User-agent: GPTBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Allow AI search crawlers

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Allow user-triggered agents

User-agent: ChatGPT-User
Allow: /

User-agent: Perplexity-User
Allow: /

# Allow traditional search

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

This configuration protects your content from training while maintaining visibility in AI search results.

Server-Level Enforcement: Adding Teeth

Robots.txt has a fundamental limitation: compliance is voluntary. TollBit research found that AI bots bypassed 13% of website block requests in Q2 2025, four times higher than Q1.

For actual enforcement, you need server-level controls. This is where user-agent blocking and rate limiting come into play.

Nginx configuration for selective blocking:

bash

# Block known training crawlers at the server level

if ($http_user_agent ~* "(GPTBot|CCBot|anthropic-ai|Google-Extended|Bytespider)") {
return 403;
}

# Allow search and user-triggered bots

# (No block for OAI-SearchBot, ChatGPT-User, PerplexityBot)

Cloudflare and CDN-Level Controls

If you're using Cloudflare or similar CDN services, you have additional options. Cloudflare launched a one-click feature to block all AI bots, available to all customers including those on the free tier. However, this is exactly the all-or-nothing approach you want to avoid.

Instead, use Cloudflare's WAF rules to create granular controls. You can build custom rules that block specific user agents while allowing others, giving you the selectivity you need. For publishers still evaluating their overall approach, our complete publisher's guide to AI crawlers provides a comprehensive framework for deciding whether to block, allow, or optimize.

Building Your Bot Classification Framework

Knowing which bots to allow and which to block requires a systematic approach. Not every publisher will make the same decisions, and your strategy should align with your specific business model.

Classification Criteria

Consider these factors when deciding how to handle each bot:

Traffic return: Does this bot drive referral traffic back to your site?
Attribution quality: When your content appears, do users get a clear path to visit your site?
Crawl frequency: How aggressively does this bot hit your server?
Compliance history: Does this operator respect robots.txt and other signals?
Business relationship: Do you have a licensing deal or partnership with this operator?

The Three-Tier Approach

Based on these criteria, you can classify bots into three categories:

Tier 1: Allow with monitoring. Bots that drive measurable traffic or have clear attribution models. This includes user-triggered crawlers and AI search indexers with good referral ratios.

Tier 2: Block by default. Training crawlers with no traffic return and high crawl volumes. These include model-training bots from major AI labs.

Tier 3: Conditional access. Bots where the value proposition is unclear or evolving. Monitor these closely and adjust based on observed behavior.

Bot Category	Typical Members	Recommended Action
User-triggered	ChatGPT-User, Perplexity-User	Allow
AI search indexers	OAI-SearchBot, PerplexityBot	Allow with monitoring
Model training	GPTBot, ClaudeBot, CCBot	Block
Bundled crawlers	Google-Extended	Evaluate individually
Unknown/unverified	Various	Block or challenge

Monitoring and Adjusting Your Strategy

Implementing a block AI bots strategy isn't a set-it-and-forget-it exercise. The AI crawler landscape changes constantly, and new bots emerge regularly.

What to Monitor

Your server logs contain the truth about which bots are actually hitting your site. Parse them regularly to identify:

New user agents: Unknown bots that aren't in your classification framework
Behavior patterns: Unusually aggressive crawling from any source
Compliance testing: Whether blocked bots are actually staying out
Referral attribution: Traffic coming from AI search platforms

Log Analysis Commands

For those comfortable with command-line tools, this grep command helps identify AI bot activity in your access logs:

bash

grep -Ei "gptbot|oai-searchbot|chatgpt-user|claudebot|perplexitybot|bytespider" access.log | awk '{print $1,$4,$7,$12}' | head -100

Adjusting Based on Results

The Atlantic reportedly meets weekly to discuss how AI bots are behaving. They track which crawlers hit their site and which lead to referral traffic and subscription conversions. This level of attention might seem excessive, but for publishers dependent on traffic, the stakes justify the effort.

Review your bot strategy at least monthly. The AI search landscape is moving too fast for quarterly reviews to catch meaningful changes. Just as major changes to Chrome's cookie blocking timeline required publishers to adapt their data strategies, the AI crawler ecosystem demands similar ongoing attention.

Visit the AI Blocking resource center.

Common Implementation Mistakes

Even publishers with good intentions make errors when implementing selective blocking. Avoid these pitfalls:

Blocking Too Broadly

One commenter on nixCraft warned against blocking Google-Extended by any means if you want to keep getting Google traffic, claiming it blocks Google from crawling and indexing whole sites despite Google's official documentation saying otherwise. The interplay between different crawlers can be complex, and overly broad blocks can have unintended consequences.

Relying Solely on Robots.txt

As noted earlier, robots.txt compliance is voluntary. If you're serious about blocking certain bots, you need server-level enforcement as backup.

Ignoring the Spoofing Problem

Fastly notes that Anthropic doesn't publish IPs at all, making it nearly impossible to verify traffic claiming to be Claude. Some bad actors spoof legitimate user-agent strings. Verification through reverse DNS lookups or IP range checking adds another layer of confidence.

Forgetting to Update

New bots appear constantly. Anthropic merged their AI data scrapers named "ANTHROPIC-AI" and "CLAUDE-WEB" into a new bot named "CLAUDEBOT," and it took websites time to discover this, during which the new bot had unprecedented access. Your blocking strategy needs regular updates to remain effective.

The Path Forward: Balancing Protection and Visibility

The AI crawler landscape won't simplify itself anytime soon. If anything, it's getting more complex as more operators launch their own bots and existing players evolve their strategies.

Cloudflare data shows the three bots with the highest number of Disallows are GPTBot, CCBot, and anthropic-ai. Compared to January, there is a steep decrease in "Partially Disallowed" permissions, with websites now flat-out choosing "Fully Disallowed" for top AI crawlers.

This all-or-nothing trend might feel satisfying from a content protection standpoint, but it potentially sacrifices visibility in an emerging channel. The publishers who will win are those who can be more nuanced.

Your goal isn't to eliminate all AI interaction with your content. It's to ensure that every interaction provides value commensurate with what's being taken. Training bots take everything and give nothing. Search bots can drive traffic. Your blocking strategy should reflect that distinction.

Next Steps:
How Programmatic Advertising Works: Maximize the value of every visitor that reaches your site
Header Bidding Guide: Extract maximum value from your traffic with competitive auction dynamics

Maximizing the Traffic You Keep

Whether you block some AI crawlers, all of them, or none, the traffic that does reach your site needs to convert into revenue. This is where your monetization strategy becomes critical. Publishers navigating these decisions should also understand how programmatic advertising works to maximize the value of every visitor.

Publishers who've implemented effective ad monetization see the compounding benefit: more traffic means more impressions, optimized layouts mean higher CPMs, and the combination multiplies your revenue potential. Understanding how header bidding maximizes competition for your inventory is particularly valuable for publishers looking to extract maximum value from their traffic.

The key is working with a monetization partner who understands the nuances of publisher traffic. Different sources behave differently. Users arriving from AI search might have different engagement patterns than organic search visitors.

Getting your AI bot strategy right protects your traffic sources. Getting your monetization right ensures that traffic translates to maximum revenue. The publishers who nail both will outperform those who focus on only one.

Ready to ensure your traffic works harder for you? Reach out to Playwire to see how advanced yield optimization can amplify your ad revenue regardless of how your visitors find you.

Share this article