Learning Center

Selective AI Blocking: How to Allow Beneficial Bots While Blocking Others

December 8, 2025

Show Editorial Policy

shield-icon-2

Editorial Policy

All of our content is generated by subject matter experts with years of ad tech experience and structured by writers and educators for ease of use and digestibility. Learn more about our rigorous interview, content production and review process here.

Selective AI Blocking: How to Allow Beneficial Bots While Blocking Others
Ready to be powered by Playwire?

Maximize your ad revenue today!

Apply Now

Key Points

  • AI crawlers fall into two distinct categories: training bots that scrape content for model development and search bots that can drive referral traffic back to your site.
  • A blanket block of all AI bots may protect your content from unauthorized training, but it also cuts off emerging traffic sources from AI search platforms.
  • Implementing a selective AI blocker filter strategy lets you maintain visibility in AI search results while blocking pure training crawlers.
  • Technical implementation requires understanding specific user agents and combining robots.txt directives with server-level controls for maximum effectiveness.
  • Monitoring and adjusting your block AI bots strategy is essential as the landscape evolves and new crawlers emerge regularly.

The All-or-Nothing Approach Is Leaving Money on the Table

Publishers face a genuine dilemma in the AI era. Block everything and you might protect your content from being absorbed into training datasets. Allow everything and your intellectual property feeds someone else's profit machine with nothing coming back. Neither extreme serves your revenue goals.

The good news? You don't have to choose between complete lockdown and open season. A nuanced approach exists that lets you welcome the bots bringing value while showing the door to those that only take. For publishers weighing this decision, understanding the real cost of blocking AI and its impact on traffic and revenue provides essential context before implementing any strategy.

Need a Primer? Read this first:

Understanding the AI Bot Ecosystem

Before implementing any AI blocker filter strategy, you need to understand what you're actually dealing with. AI bots aren't a monolithic category. They serve fundamentally different purposes, and lumping them together is like treating all vehicles the same whether they're delivery trucks or getaway cars.

Training Crawlers: The Content Vacuum

Training crawlers exist to gather massive amounts of web content for developing and refining large language models. These bots scrape text, images, and structured data to build the datasets that power AI systems.

The key characteristic of training crawlers? They take without giving back. Your content goes in, and nothing comes out the other end toward your site. No traffic, no attribution, no compensation. Cloudflare data shows training now drives nearly 80% of AI bot activity, up from 72% a year ago. Knowing exactly which bots fall into this category is critical — our complete list of AI crawlers and how to block each one breaks down the full landscape.

Bot Name

Operator

Primary Purpose

Robots.txt Compliance

GPTBot

OpenAI

Model training data collection

Generally respects

ClaudeBot

Anthropic

Training data for Claude models

Generally respects

CCBot

Common Crawl

Dataset building for AI research

Generally respects

Google-Extended

Google

Gemini model training

Generally respects

Bytespider

ByteDance

Training data for Doubao/TikTok AI

Inconsistent

Search and Retrieval Bots: The Traffic Drivers

Search and retrieval bots serve a different purpose entirely. These crawlers index content to power AI search engines and real-time answer retrieval. When users ask questions through AI platforms, these bots help surface your content and can drive referral traffic back to your site.

Bot Name

Operator

Primary Purpose

Traffic Potential

OAI-SearchBot

OpenAI

ChatGPT Search indexing

Medium to High

ChatGPT-User

OpenAI

Real-time user query retrieval

High

PerplexityBot

Perplexity AI

Answer engine indexing

Medium

Perplexity-User

Perplexity AI

User-triggered content fetch

High

The distinction matters enormously for publishers. Blocking GPTBot might make sense for protecting your training data rights. Blocking OAI-SearchBot cuts you off from appearing in ChatGPT search results entirely.

AI Crawler Blocking Decision Tool

The Revenue Case for Selective Blocking

Publishers who rely on ad revenue have a particularly compelling reason to think strategically about which bots to block. Your business model depends on traffic volume. More visitors mean more ad impressions, which translates directly to revenue. Understanding what you need to know about ad blocking rate helps contextualize why protecting traffic sources matters so much for your bottom line.

The Crawl-to-Referral Imbalance

Not all AI platforms treat publishers equally. According to Cloudflare data, the crawl-to-referral ratios vary dramatically across platforms. In July 2025, Anthropic crawled 38,000 pages for every visitor referred back to publishers, while OpenAI maintained a ratio of 1,091 crawls per referral. Perplexity's ratio sits at 194 crawls per visitor.

These numbers reveal which platforms are taking the most while giving back the least. This imbalance should inform your blocking strategy.

Traffic Sources You Cannot Afford to Ignore

AI search is becoming a meaningful traffic channel. Similarweb data shows ChatGPT sent 243.8 million visits to 250 news and media websites in April 2025, up 98% from January. Blocking these crawlers entirely means voluntarily removing yourself from an emerging discovery channel.

The publishers seeing the best results are those treating AI search as a new acquisition channel rather than purely a threat. This requires selective blocking rather than nuclear options. Some publishers are taking this a step further by learning how to get AI tools to cite your website as an alternative to blocking.

Related Content:

How to Implement Your Selective AI Blocker Filter

The technical implementation of selective blocking requires working at multiple levels. Robots.txt provides the foundation, but server-level controls add enforcement teeth.

Robots.txt: The Foundation Layer

Your robots.txt file remains the primary signal to well-behaved bots. HTTP Archive data from July 2025 shows that 94% of 12 million websites have a robots.txt file containing at least one directive. The problem? Many publishers still use all-or-nothing approaches. For a deep dive into this foundational technique, our guide on how to block AI bots with robots.txt covers everything from basic syntax to advanced configurations.

A selective configuration looks fundamentally different from a blanket block. Here's a template that blocks training crawlers while allowing search and user-triggered bots:

robots.txt

# Block model-training crawlers


User-agent: GPTBot
Disallow: /


User-agent: anthropic-ai
Disallow: /


User-agent: Google-Extended
Disallow: /


User-agent: CCBot
Disallow: /


User-agent: Bytespider
Disallow: /


# Allow AI search crawlers


User-agent: OAI-SearchBot
Allow: /


User-agent: PerplexityBot
Allow: /


# Allow user-triggered agents


User-agent: ChatGPT-User
Allow: /


User-agent: Perplexity-User
Allow: /


# Allow traditional search


User-agent: Googlebot
Allow: /


User-agent: Bingbot
Allow: /

 

This configuration protects your content from training while maintaining visibility in AI search results.

Server-Level Enforcement: Adding Teeth

Robots.txt has a fundamental limitation: compliance is voluntary. TollBit research found that AI bots bypassed 13% of website block requests in Q2 2025, four times higher than Q1.

For actual enforcement, you need server-level controls. This is where user-agent blocking and rate limiting come into play.

Nginx configuration for selective blocking:

bash

# Block known training crawlers at the server level


if ($http_user_agent ~* "(GPTBot|CCBot|anthropic-ai|Google-Extended|Bytespider)") {
return 403;
}


# Allow search and user-triggered bots


# (No block for OAI-SearchBot, ChatGPT-User, PerplexityBot)

Cloudflare and CDN-Level Controls

If you're using Cloudflare or similar CDN services, you have additional options. Cloudflare launched a one-click feature to block all AI bots, available to all customers including those on the free tier. However, this is exactly the all-or-nothing approach you want to avoid.

Instead, use Cloudflare's WAF rules to create granular controls. You can build custom rules that block specific user agents while allowing others, giving you the selectivity you need. For publishers still evaluating their overall approach, our complete publisher's guide to AI crawlers provides a comprehensive framework for deciding whether to block, allow, or optimize.

Building Your Bot Classification Framework

Knowing which bots to allow and which to block requires a systematic approach. Not every publisher will make the same decisions, and your strategy should align with your specific business model.

Classification Criteria

Consider these factors when deciding how to handle each bot:

  • Traffic return: Does this bot drive referral traffic back to your site?
  • Attribution quality: When your content appears, do users get a clear path to visit your site?
  • Crawl frequency: How aggressively does this bot hit your server?
  • Compliance history: Does this operator respect robots.txt and other signals?
  • Business relationship: Do you have a licensing deal or partnership with this operator?

The Three-Tier Approach

Based on these criteria, you can classify bots into three categories:

Tier 1: Allow with monitoring. Bots that drive measurable traffic or have clear attribution models. This includes user-triggered crawlers and AI search indexers with good referral ratios.

Tier 2: Block by default. Training crawlers with no traffic return and high crawl volumes. These include model-training bots from major AI labs.

Tier 3: Conditional access. Bots where the value proposition is unclear or evolving. Monitor these closely and adjust based on observed behavior.

Bot Category

Typical Members

Recommended Action

User-triggered

ChatGPT-User, Perplexity-User

Allow

AI search indexers

OAI-SearchBot, PerplexityBot

Allow with monitoring

Model training

GPTBot, ClaudeBot, CCBot

Block

Bundled crawlers

Google-Extended

Evaluate individually

Unknown/unverified

Various

Block or challenge

Monitoring and Adjusting Your Strategy

Implementing a block AI bots strategy isn't a set-it-and-forget-it exercise. The AI crawler landscape changes constantly, and new bots emerge regularly.

What to Monitor

Your server logs contain the truth about which bots are actually hitting your site. Parse them regularly to identify:

  • New user agents: Unknown bots that aren't in your classification framework
  • Behavior patterns: Unusually aggressive crawling from any source
  • Compliance testing: Whether blocked bots are actually staying out
  • Referral attribution: Traffic coming from AI search platforms

Log Analysis Commands

For those comfortable with command-line tools, this grep command helps identify AI bot activity in your access logs:

bash

grep -Ei "gptbot|oai-searchbot|chatgpt-user|claudebot|perplexitybot|bytespider" access.log | awk '{print $1,$4,$7,$12}' | head -100

 

 

Adjusting Based on Results

The Atlantic reportedly meets weekly to discuss how AI bots are behaving. They track which crawlers hit their site and which lead to referral traffic and subscription conversions. This level of attention might seem excessive, but for publishers dependent on traffic, the stakes justify the effort.

Review your bot strategy at least monthly. The AI search landscape is moving too fast for quarterly reviews to catch meaningful changes. Just as major changes to Chrome's cookie blocking timeline required publishers to adapt their data strategies, the AI crawler ecosystem demands similar ongoing attention.

Common Implementation Mistakes

Even publishers with good intentions make errors when implementing selective blocking. Avoid these pitfalls:

Blocking Too Broadly

One commenter on nixCraft warned against blocking Google-Extended by any means if you want to keep getting Google traffic, claiming it blocks Google from crawling and indexing whole sites despite Google's official documentation saying otherwise. The interplay between different crawlers can be complex, and overly broad blocks can have unintended consequences.

Relying Solely on Robots.txt

As noted earlier, robots.txt compliance is voluntary. If you're serious about blocking certain bots, you need server-level enforcement as backup.

Ignoring the Spoofing Problem

Fastly notes that Anthropic doesn't publish IPs at all, making it nearly impossible to verify traffic claiming to be Claude. Some bad actors spoof legitimate user-agent strings. Verification through reverse DNS lookups or IP range checking adds another layer of confidence.

Forgetting to Update

New bots appear constantly. Anthropic merged their AI data scrapers named "ANTHROPIC-AI" and "CLAUDE-WEB" into a new bot named "CLAUDEBOT," and it took websites time to discover this, during which the new bot had unprecedented access. Your blocking strategy needs regular updates to remain effective.

AI Crawler Grader

The Path Forward: Balancing Protection and Visibility

The AI crawler landscape won't simplify itself anytime soon. If anything, it's getting more complex as more operators launch their own bots and existing players evolve their strategies.

Cloudflare data shows the three bots with the highest number of Disallows are GPTBot, CCBot, and anthropic-ai. Compared to January, there is a steep decrease in "Partially Disallowed" permissions, with websites now flat-out choosing "Fully Disallowed" for top AI crawlers.

This all-or-nothing trend might feel satisfying from a content protection standpoint, but it potentially sacrifices visibility in an emerging channel. The publishers who will win are those who can be more nuanced.

Your goal isn't to eliminate all AI interaction with your content. It's to ensure that every interaction provides value commensurate with what's being taken. Training bots take everything and give nothing. Search bots can drive traffic. Your blocking strategy should reflect that distinction.

Next Steps:

Maximizing the Traffic You Keep

Whether you block some AI crawlers, all of them, or none, the traffic that does reach your site needs to convert into revenue. This is where your monetization strategy becomes critical. Publishers navigating these decisions should also understand how programmatic advertising works to maximize the value of every visitor.

Publishers who've implemented effective ad monetization see the compounding benefit: more traffic means more impressions, optimized layouts mean higher CPMs, and the combination multiplies your revenue potential. Understanding how header bidding maximizes competition for your inventory is particularly valuable for publishers looking to extract maximum value from their traffic.

The key is working with a monetization partner who understands the nuances of publisher traffic. Different sources behave differently. Users arriving from AI search might have different engagement patterns than organic search visitors. 

Getting your AI bot strategy right protects your traffic sources. Getting your monetization right ensures that traffic translates to maximum revenue. The publishers who nail both will outperform those who focus on only one.

Ready to ensure your traffic works harder for you? Reach out to Playwire to see how advanced yield optimization can amplify your ad revenue regardless of how your visitors find you.

New call-to-action