What are the different types of AI crawlers I should know about?

AI crawlers fall into three main categories: AI Data Scrapers (like GPTBot, ClaudeBot, CCBot) that harvest content for model training, AI Search Crawlers (like PerplexityBot, OAI-SearchBot) that index for AI-powered search engines, and AI Assistants (like ChatGPT-User, Meta-ExternalFetcher) that retrieve content in real-time to answer queries.

How does AI scraping affect publisher ad revenue?

AI-powered search summaries reduce publisher traffic by 20% to 60% on average, with niche publications experiencing losses approaching 90%. These traffic reductions translate to approximately $2 billion in annual advertising revenue losses across the publishing sector, as every visitor scraped away represents potential ad impressions lost.

Learning Center

How to Block AI From Scraping Your Website: A Technical Implementation Guide

Q: Should I block all AI crawlers or just training scrapers?

This depends on your business goals. Training scrapers like GPTBot for model development provide no traffic benefit, so blocking them is straightforward. AI search crawlers may provide some visibility in AI-powered search results. Consider blocking training scrapers while evaluating search crawler access based on your referral traffic data.

Playwire Strategy Team

December 8, 2025

Show Editorial Policy

Editorial Policy

All of our content is generated by subject matter experts with years of ad tech experience and structured by writers and educators for ease of use and digestibility. Learn more about our rigorous interview, content production and review process here.

AI Blocking

How to Block AI From Scraping Your Website: A Technical Implementation Guide

Ready to be powered by Playwire?

Maximize your ad revenue today!

Apply Now

Key Points

Robots.txt provides the first line of defense: Compliant AI crawlers like GPTBot, ClaudeBot, and PerplexityBot respect robots.txt directives, making this file your simplest mechanism to block AI scrapers from accessing your content.
Server-side blocking offers enforcement teeth: Unlike robots.txt, which is voluntary, Apache .htaccess and Nginx configurations actively deny requests from AI scrapers attempting to harvest your content.
Rate limiting protects server resources: Even when allowing some AI crawler access, rate limiting prevents aggressive scraping from overwhelming your infrastructure and degrading user experience.
Verification methods confirm your blocks work: Testing with curl commands and online checker tools ensures your implementation actually stops the bots you're targeting.
Protecting your traffic protects your revenue: For publishers relying on ad monetization, every visitor scraped away represents potential ad impressions lost and revenue that never materializes.

The Stakes for Publishers: Why You Need to Block AI Scrapers

AI crawlers have fundamentally changed the web's ecosystem. Unlike traditional search engine bots that index your content and send visitors back to your site, AI scrapers download your content to train models that may never reference you again.

The numbers paint a stark picture. According to industry analysis from the IAB Tech Lab, AI-powered search summaries reduce publisher traffic by 20% to 60% on average, with niche publications experiencing losses approaching 90%. These reductions translate to approximately $2 billion in annual advertising revenue losses across the publishing sector.

Need a Primer? Read this first:
The Complete Publisher's Guide to AI Crawlers: Understand when to block, allow, or optimize AI crawler access for maximum revenue
Should You Block AI Crawlers?: A decision framework to help you evaluate whether blocking makes sense for your site

According to recent research on AI bots and robots.txt, as of July 2025, AI bots top the list of user agents referenced across popular sites. Almost 21% of the top 1000 websites now have rules for ChatGPT's "GPTBot" in their robots.txt file.

The stakes here are straightforward for publishers. Your traffic drives your ad revenue. When AI models consume your content without sending visitors, you're essentially subsidizing their training with your server resources while receiving nothing in return. Understanding how to manage and monitor your website ad revenue metrics becomes critical when external forces threaten your traffic foundation.

This guide provides the technical implementation details you need to take control.

Understanding AI Crawler Types Before You Block AI Bots

Before blocking anything, you need to understand what you're dealing with. AI crawlers fall into three main categories, and knowing the difference helps you make smarter blocking decisions about which bots to stop. For a deeper dive into the strategic considerations, our complete publisher's guide to AI crawlers covers blocking, allowing, or optimizing for maximum revenue.

AI Data Scrapers

These bots harvest content to train large language models. According to Originality.ai's documentation, GPTBot is developed by OpenAI to crawl web sources and download training data for the company's Large Language Models and products like ChatGPT. Other major training scrapers include ClaudeBot (Anthropic), CCBot (Common Crawl), and Bytespider (ByteDance).

AI Search Crawlers

These bots index content for AI-powered search engines. PerplexityBot and OAI-SearchBot fall into this category. Blocking these might impact your visibility in AI search results, so consider whether that trade-off makes sense for your business.

AI Assistants

Bots like ChatGPT-User and Meta-ExternalFetcher retrieve content in real-time to answer user queries. As Neil Clarke notes, the Meta-ExternalFetcher crawler performs user-initiated fetches of individual links in support of some AI tools. Because the fetch was initiated by a user, this crawler may bypass robots.txt rules entirely.

Crawler Type	Primary Function	robots.txt Compliance	Blocking Impact
Training Scrapers	LLM model training	Generally yes	Prevents content use in training
Search Crawlers	AI search indexing	Yes	Reduces AI search visibility
AI Assistants	Real-time query responses	Varies	May reduce citation in responses

Method 1: Block AI Scrapers with Robots.txt Implementation

The robots.txt file remains your first line of defense when you want to block AI from scraping your website. As explained in this LLM crawler blocking guide, this small text file tells crawlers which parts of your site they are allowed to access. Most legitimate AI crawlers, like GPTBot, ClaudeBot, PerplexityBot, and CCBot, officially state that they respect robots.txt.

For comprehensive instructions on this foundational method, see our complete publisher's guide to blocking AI bots with robots.txt.

Basic Robots.txt Syntax

The robots.txt file uses a simple syntax. Each rule set begins with a User-agent declaration followed by Disallow directives specifying paths to block.

Place your robots.txt file in your website's root directory. For example, if your site is example.com, the file should be accessible at example.com/robots.txt.

Complete AI Blocker Template

Here's a comprehensive robots.txt template covering major AI crawlers:

robots.txt

# Block AI Training Scrapers

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: YouBot
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: ImagesiftBot
Disallow: /

Selective Blocking Strategies

You don't have to block everything. Some publishers allow AI search crawlers while blocking training scrapers. This approach maintains visibility in AI-powered search while protecting content from model training.

To allow specific bots access to certain directories while blocking them from others:

robots.txt

User-agent: GPTBot
Disallow: /premium-content/
Disallow: /members-only/
Allow: /blog/

Critical Limitations of Robots.txt

Here's the uncomfortable truth about using robots.txt to block AI scrapers. Cloudflare's documentation states clearly that respecting robots.txt is voluntary. Some crawler operators may disregard your robots.txt preferences and crawl your content regardless.

This is why robots.txt should be your first layer, not your only layer. Understanding the legal landscape publishers need to know about blocking AI scrapers in 2025 helps you understand your rights when crawlers ignore your directives.

Method 2: Server-Side Blocking with Apache (.htaccess)

Server-side blocking provides enforcement that robots.txt cannot offer when you need to block AI bots more aggressively. As WEBLYNX explains, unlike robots.txt, bots and scrapers cannot bypass any rules you have configured on your web server.

Apache .htaccess Configuration

For Apache and LiteSpeed servers, add this configuration to your .htaccess file in your website's root directory:

bash

<IfModule mod_rewrite.c>

RewriteEngine On

RewriteBase /

# Block AI crawlers and scrapers

RewriteCond %{HTTP_USER_AGENT} ^.*(GPTBot|ChatGPT-User|ClaudeBot|Claude-Web|anthropic-ai|CCBot|Google-Extended|PerplexityBot|Bytespider|Diffbot|FacebookBot|Meta-ExternalAgent|Amazonbot|cohere-ai|YouBot|Omgilibot|ImagesiftBot|Applebot-Extended).*$ [NC]

RewriteRule .* - [F,L]

</IfModule>

This configuration returns a 403 Forbidden response to any request matching these user agents. The [NC] flag makes the match case-insensitive, and [F,L] triggers the forbidden response and stops processing additional rules.

Understanding the Syntax

The RewriteCond directive checks the HTTP\USER\AGENT header against a regular expression pattern. The ^. and .$ anchors ensure the pattern matches anywhere within the user agent string.

This matters because a common mistake is blocking the exact string only, when the actual user-agent is something like "Mozilla/5.0 (compatible; SemrushBot/7\~bl; \+http://www.semrush.com/bot.html)". AI crawlers embed their identifiers within longer user agent strings, so your pattern needs to account for surrounding text.

Visit the AI Blocking resource center.

Method 3: Block AI Scrapers with Nginx Server Configuration

Nginx servers require configuration in the server block rather than an .htaccess file when you need to block AI from scraping your website.

Nginx Configuration

Add this to your Nginx configuration file (typically located at /etc/nginx/sites-enabled/yoursite.conf or within your server block):

bash

# Block AI crawlers
if ($http_user_agent ~* "(GPTBot|ChatGPT-User|ClaudeBot|Claude-Web|anthropic-ai|CCBot|Google-Extended|PerplexityBot|Bytespider|Diffbot|FacebookBot|Meta-ExternalAgent|Amazonbot|cohere-ai|YouBot|Omgilibot|ImagesiftBot|Applebot-Extended)") {
    return 403;
}

The ~* operator performs a case-insensitive regex match. After adding this configuration, restart Nginx with sudo systemctl reload nginx or sudo nginx -s reload.

Alternative: Nginx Map Method

For better performance with large bot lists, use the map directive in your http block:

bash

map $http_user_agent $blocked_agent {
    default 0;
    ~*GPTBot 1;
    ~*ClaudeBot 1;
    ~*CCBot 1;
    ~*Bytespider 1;
    # Add additional bots as needed
}
server {
    # In your server block
    if ($blocked_agent) {
        return 403;
    }
}

As noted in the nginx-ultimate-bad-bot-blocker project, Nginx has a 444 error that literally drops the connection. If a rule matches, the requesting IP simply gets no response and it would appear that your server does not exist to them. Consider using return 444; instead of return 403; for this stealth blocking approach.

Related Content:
How to Block Google AI Overview: Prevent your content from appearing in AI-generated search summaries
The Legal Landscape of Blocking AI Scrapers: Understand your rights when crawlers ignore your blocking directives
Is AI Killing the Open Internet?: Explore the broader implications of AI scraping on the publishing ecosystem

Method 4: Rate Limiting to Control AI Bot Access

Rate limiting provides a middle-ground approach for publishers who want to manage rather than completely block AI scrapers. Rather than blocking AI crawlers entirely, you limit how quickly they can access your content. This protects server resources while potentially maintaining some AI search visibility.

Cloudflare Rate Limiting

According to Cloudflare's WAF documentation, rate limiting rules allow you to define rate limits for requests matching an expression and the action to perform when those rate limits are reached.

Cloudflare's WAF offers granular rate limiting controls. Configure rules that target specific user agents:

Parameter	Recommended Setting	Purpose
Requests per period	10-50	Maximum requests before triggering
Period	60 seconds	Time window for counting requests
Duration	300-600 seconds	How long to block after limit exceeded
Action	Block or Challenge	Response when limit hit

Server-Level Rate Limiting

For Nginx, use the limit\_req module to control AI bot access:

bash

http {
    limit_req_zone $binary_remote_addr zone=ai_limit:10m rate=5r/s;
    server {
        location / {
            if ($http_user_agent ~* "(GPTBot|ClaudeBot|CCBot)") {
                limit_req zone=ai_limit burst=10 nodelay;
            }
        }
    }
}

Apache users can implement rate limiting with mod\ratelimit or mod\evasive, though these require additional module installation.

Method 5: Cloudflare AI Crawl Control

For publishers using Cloudflare, the platform offers managed AI crawler blocking that handles the complexity for you when you want to block AI bots without manual configuration.

According to Cloudflare's learning center, Cloudflare AI Crawl Control helps web content owners regain control over AI crawlers. Cloudflare protects around 20% of all web properties, giving it deep insight into all kinds of crawler activity.

Managed robots.txt

Cloudflare's managed robots.txt documentation explains that Cloudflare will independently check whether your website has an existing robots.txt file and update the behavior of this feature. If your website already has a robots.txt file, Cloudflare will prepend their managed robots.txt before your existing one, combining both into a single response.

Navigate to your Cloudflare dashboard, select your domain, and find AI Crawl Control under the Security or Bots section. Enable the managed robots.txt feature to automatically block known AI crawlers.

WAF Rules for AI Crawlers

Create custom WAF rules to block or challenge AI crawlers:

Rule name: Block AI Training Crawlers
Field: User Agent
Operator: Contains
Value: GPTBot OR ClaudeBot OR CCBot OR Bytespider
Action: Block

This approach provides several advantages over manual configuration. Cloudflare maintains updated lists of AI crawler user agents. The blocking happens at the edge before requests reach your origin server. You get detailed analytics on blocked requests.

Verification: Testing Your AI Scraper Blocking Implementation

Implementing blocks without verification is like installing a lock without checking if it works. You need to confirm your rules are actually stopping the bots you're targeting when you block AI scrapers.

Testing with curl

For local testing, use curl to spoof AI crawler user agents:

bash

# Test GPTBot blocking
curl -I -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" https://yoursite.com/
# Test ClaudeBot blocking
curl -I -A "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" https://yoursite.com/

A successful block returns a 403 Forbidden status. If you see 200 OK, your blocking isn't working.

Next Steps:
How to Manage and Monitor Your Website Ad Revenue Metrics: Track the impact of your traffic protection efforts on your bottom line
The Ultimate Guide to Monetizing Your Website: Maximize revenue from the traffic you've protected
Working with Playwire: See what the implementation process looks like with our monetization platform

Log Analysis

Your server logs reveal which bots are actually hitting your site. Look for user agent strings containing AI crawler identifiers. Compare timestamps before and after implementing blocks to confirm reduction in AI crawler traffic. Setting up proper GA4 analytics to unlock powerful website insights complements your server log analysis with user behavior data.

For Apache, check access logs at /var/log/apache2/access.log. For Nginx, check /var/log/nginx/access.log. Search for patterns like:

bash

grep -i "gptbot\|claudebot\|ccbot" /var/log/nginx/access.log

Handling Crawlers That Ignore Your AI Blocking Rules

Not all crawlers play nice. As Search Engine Journal reports, fake crawlers can spoof legitimate user agents to bypass restrictions and scrape content aggressively. Anyone can impersonate ClaudeBot from their laptop and initiate a crawl request from the terminal.

IP Verification

Search Engine Journal's crawler guide notes that the most reliable verification method is checking the request IP against officially declared IP ranges. If the IP matches, allow the request; otherwise, block it.

Major AI companies publish their crawler IP ranges. OpenAI, Anthropic, and Google all provide documentation on verifying legitimate crawler requests. Implement IP allowlisting alongside user agent blocking for defense in depth.

Behavior-Based Detection

Aggressive crawlers often exhibit telltale patterns:

High request frequency: Legitimate crawlers usually respect implicit rate limits
Sequential page access: Bots often crawl pages in URL order rather than following natural link patterns
Lack of JavaScript execution: Most AI scrapers don't render JavaScript

WAF solutions like Cloudflare Bot Management can identify these patterns automatically.

The Traffic Protection Payoff for Publishers

Every implementation decision here traces back to a simple reality: your traffic is your revenue. AI crawlers that scrape your content without sending visitors represent a direct threat to your monetization potential.

According to Cloudflare's analysis, as of June 2025, Google crawls websites about 14 times for every referral. But for AI companies, the crawl-to-referral ratio is orders of magnitude greater: OpenAI's ratio was 1,700:1, and Anthropic's was 73,000:1.

These ratios explain why publishers are increasingly blocking AI crawlers. The traffic exchange that made traditional search engine crawling acceptable simply doesn't exist with AI training scrapers. If you're still weighing whether blocking makes sense for your specific situation, our decision framework helps publishers evaluate whether to block AI crawlers.

Google's AI Overviews present a related but distinct challenge. If you're concerned about your content appearing in AI-generated search summaries, learn how to block Google AI Overview from using your content.

Maximizing Revenue From Your Protected Traffic

Blocking AI scrapers protects your traffic, but protecting traffic is only half the equation. The visitors you keep need to generate maximum revenue. Understanding the complete guide to taking control of your ad revenue through automated monetization helps you extract maximum value from every session.

See It In Action:
Traffic Shaping Revolution: How intelligent traffic management boosted publisher revenue by 12%

This is where working with a dedicated ad monetization partner makes the difference. Playwire's RAMP Platform combines machine learning technology with expert yield management to squeeze more value from every pageview you protect.

Publishers in the Playwire network benefit from:

AI-powered yield optimization: Algorithms that manage price floors across millions of rules per site
Premium demand access: Direct sales relationships that drive CPMs far above programmatic rates
Expert yield operations: A team that monitors performance 24/7, catching revenue dips before they become problems
Advanced analytics: Real-time data showing exactly how your protected traffic converts to revenue

For a comprehensive understanding of how to turn your protected traffic into sustainable revenue, explore the ultimate guide to monetizing your website with ads.

You've done the work to block AI scrapers and protect your content. Now make sure that protected traffic delivers the ad revenue it deserves. Curious about what the implementation process looks like? Our overview of working with Playwire covers technical implementation and timeline expectations.

Ready to amplify your ad revenue? Reach out to the Playwire team to see what your protected traffic could really earn.

Frequently Asked Questions About Blocking AI Scrapers

What is the most effective way to block AI from scraping my website?

The most effective approach combines multiple methods: robots.txt for compliant crawlers, server-side blocking via .htaccess or Nginx for enforcement, and Cloudflare or similar WAF solutions for comprehensive protection. No single method blocks all AI scrapers, so layered defense provides the strongest protection.

Will blocking AI scrapers affect my Google search rankings?

Blocking AI training crawlers like GPTBot and Google-Extended does not affect your Google search rankings. According to Raptive's documentation, Google documentation and direct confirmation from the Google Search team confirm that the Google-Extended user agent doesn't affect your site's search rankings or inclusion in AI Overviews.

How do I know if my AI scraper blocking is working?

Use curl commands to spoof AI crawler user agents and check for 403 Forbidden responses. Online tools like CrawlerCheck.com and Dark Visitors can verify your robots.txt configuration. Analyze your server logs to confirm reduced AI crawler traffic after implementation.

Can AI scrapers bypass robots.txt blocking?

Yes. Robots.txt is voluntary, and some crawlers ignore it entirely. For enforcement, implement server-side blocking through .htaccess (Apache) or server block configurations (Nginx). These methods actively deny requests rather than requesting compliance.

Should I block all AI crawlers or just training scrapers?

This depends on your business goals. Training scrapers (like GPTBot for model development) provide no traffic benefit, so blocking them is straightforward. AI search crawlers may provide some visibility in AI-powered search results. Consider blocking training scrapers while evaluating search crawler access based on your referral traffic data.

Share this article

AI Blocking

Self-Service or Managed Service?

Flex Suite

Get in Touch

How to Block AI From Scraping Your Website: A Technical Implementation Guide

Editorial Policy

Ready to be powered by Playwire?

Key Points

The Stakes for Publishers: Why You Need to Block AI Scrapers

Need a Primer? Read this first:

Understanding AI Crawler Types Before You Block AI Bots

AI Data Scrapers

AI Search Crawlers

AI Assistants

Method 1: Block AI Scrapers with Robots.txt Implementation

Basic Robots.txt Syntax

Complete AI Blocker Template

Selective Blocking Strategies

Critical Limitations of Robots.txt

Method 2: Server-Side Blocking with Apache (.htaccess)

Apache .htaccess Configuration

Understanding the Syntax

Method 3: Block AI Scrapers with Nginx Server Configuration

Nginx Configuration

Alternative: Nginx Map Method

Related Content:

Method 4: Rate Limiting to Control AI Bot Access

Cloudflare Rate Limiting

Server-Level Rate Limiting

Method 5: Cloudflare AI Crawl Control

Managed robots.txt

WAF Rules for AI Crawlers

Verification: Testing Your AI Scraper Blocking Implementation

Testing with curl

Next Steps:

Log Analysis

Handling Crawlers That Ignore Your AI Blocking Rules

IP Verification

Behavior-Based Detection

The Traffic Protection Payoff for Publishers

Maximizing Revenue From Your Protected Traffic

See It In Action:

Frequently Asked Questions About Blocking AI Scrapers

What is the most effective way to block AI from scraping my website?

Will blocking AI scrapers affect my Google search rankings?

How do I know if my AI scraper blocking is working?

Can AI scrapers bypass robots.txt blocking?

Should I block all AI crawlers or just training scrapers?

Related Articles