What is the difference between search engine crawlers and AI training crawlers?

Search engine crawlers like Googlebot index your content to drive traffic back to your site through search results. AI training crawlers like GPTBot and ClaudeBot extract your content to train language models, often without sending any visitors your way. The fundamental difference is that search crawlers create a mutually beneficial relationship while AI crawlers primarily benefit the AI company.

How do I block AI crawlers using robots.txt?

Add User-agent directives for each AI crawler followed by 'Disallow: /' to block them. For example, add 'User-agent: GPTBot' and 'Disallow: /' on separate lines. You can allow Googlebot and Bingbot while blocking GPTBot, ClaudeBot, CCBot, and Google-Extended. Compliance is voluntary but major AI companies honor these directives.

What protection options exist beyond robots.txt?

Publishers can implement CDN-level blocking through services like Cloudflare, which now blocks AI crawlers by default for new websites. Additional options include server-level user-agent blocking via nginx or Apache configuration, meta tag signals like 'noai' and 'noimageai', and rate limiting to prevent any single bot from overwhelming server resources.

Learning Center

AI Scraping vs. Traditional SEO Crawling: What Publishers Need to Know About Blocking AI

Q: Does blocking AI crawlers affect SEO performance?

No. Research tracking thousands of publisher sites over a year-long period found no statistically significant traffic changes when sites blocked AI crawlers. The average traffic variation between sites that blocked AI bots and those that didn't remained within 1%, which falls within normal fluctuation ranges.

Playwire Strategy Team

December 8, 2025

Show Editorial Policy

Editorial Policy

All of our content is generated by subject matter experts with years of ad tech experience and structured by writers and educators for ease of use and digestibility. Learn more about our rigorous interview, content production and review process here.

AI Blocking

AI Scraping vs. Traditional SEO Crawling: What Publishers Need to Know About Blocking AI

Ready to be powered by Playwire?

Maximize your ad revenue today!

Apply Now

Key Points

Search engine crawlers and AI training crawlers serve fundamentally different purposes: Traditional bots like Googlebot index your content to drive traffic back to your site, while AI crawlers like GPTBot extract your content to train language models.
Blocking AI crawlers does not harm your SEO: Data from major publisher networks shows no statistically significant traffic changes when publishers block AI training bots while keeping search crawlers enabled.
Selective blocking AI bots is your strategic advantage: You can maintain full search visibility while protecting your content from being used to train AI systems that may never send visitors your way.
The robots.txt file remains your primary tool: Properly configured directives let you allow Googlebot while blocking GPTBot, ClaudeBot, and other AI training crawlers.
Traffic protection directly impacts ad revenue: Every pageview protected from AI-driven zero-click searches represents preserved ad impressions and monetization opportunities.

The Great Bot Divide: Why Blocking AI Matters for Publishers

Here's a truth that keeps publishers up at night: the bots crawling your site today aren't playing by the same rules. Search engine crawlers and AI training crawlers look similar in your server logs, but their intentions couldn't be more different. Understanding this distinction is the difference between protecting your traffic and accidentally tanking your SEO.

The internet has operated on a simple exchange for decades. Search engines index your content and direct users back to your website, generating traffic and ad revenue. AI crawlers have fundamentally broken this social contract.

They extract your content to make their systems smarter, often without sending a single visitor your way. For publishers navigating this new reality, understanding how AI traffic is reshaping SEO and how to optimize for AI referrals has become essential.

Recent data paints a stark picture. Training is the clear leader. Over the past 12 months, 80% of AI crawling was for training, compared with 18% for search and just 2% for user actions. Meanwhile, the crawl-to-referral ratio for some AI companies reaches staggering levels, with Anthropic's Claude showing ratios from 50,000:1 to 70,900:1.

Need a Primer? Read this first:
AI Traffic is the New SEO: Understand how AI is reshaping SEO and what optimization for AI referrals looks like
Is AI Killing the Open Internet: Get the big-picture context on how AI is impacting publisher traffic and revenue
The Shift to AI Search: Learn how AI-powered search is fundamentally changing user behavior and click-through rates

How Search Engine Crawlers Actually Work

Traditional search crawlers like Googlebot have been the backbone of web discovery since the internet's early days. These bots systematically navigate websites by following links, indexing content, and storing information in massive databases that power search results.

The fundamental purpose of a search crawler is retrieval. Googlebot wants to understand your content so it can surface your pages when users search for relevant topics. This creates a mutually beneficial relationship where quality content earns visibility, which drives traffic, which generates revenue. Understanding what publishers and advertisers need to know about programmatic advertising helps contextualize why this traffic matters so much for monetization.

Search crawlers typically follow these behaviors:

Respectful crawling patterns: They honor robots.txt directives and crawl-delay requests.
Indexing for retrieval: Content is stored to be matched against search queries.
Traffic generation: The end goal is directing users to your website.
Transparent identification: Legitimate search bots identify themselves clearly in user-agent strings.

AI Training Crawlers: A Different Beast Entirely

AI training crawlers represent a paradigm shift in how content is consumed on the web. Bots like GPTBot (OpenAI), ClaudeBot (Anthropic), and Meta-ExternalAgent don't index your content for search results. They extract information to train large language models that power AI assistants and chatbots.

The economics here work entirely in the AI company's favor. Your carefully crafted content becomes training data that helps AI systems generate answers directly to users, often eliminating any reason for those users to visit your site.

It's your expertise, powering their product, without compensation or traffic in return. Publishers weighing their options should understand the real cost of blocking AI crawlers on traffic and revenue before making decisions.

The AI crawler landscape has exploded in recent months. GPTBot's share grew from 2.2% to 7.7% of all crawler traffic, with a 305% rise in requests between May 2024 and May 2025. Meanwhile, some AI crawlers have earned reputations for aggressive behavior, with certain bots making enormous volumes of requests that strain server infrastructure.

Crawler Type	Primary Purpose	Traffic Benefit	Content Usage
Googlebot	Search indexing	High (drives clicks)	Surfaces pages in SERPs
Bingbot	Search indexing	Moderate	Surfaces pages in Bing
GPTBot	LLM training	Minimal to none	Trains ChatGPT models
ClaudeBot	LLM training	Minimal to none	Trains Claude models
Meta-ExternalAgent	AI model training	Minimal to none	Powers Meta AI products
PerplexityBot	AI search/retrieval	Low	Generates AI answers

How Does Blocking AI Crawlers Work with robots.txt?

Your robots.txt file is the primary tool for controlling crawler access to your site. This simple text file in your website's root directory tells bots which areas they can and cannot access. The catch? Compliance is voluntary.

Well-behaved crawlers from major companies generally honor robots.txt directives. Google, Microsoft, OpenAI, and Anthropic have all publicly stated that their crawlers respect these rules. However, some less scrupulous bots ignore these guidelines entirely, scraping whatever they want regardless of your preferences. For a deeper dive into enforcement options, our complete publisher's guide to AI crawlers covers how to block, allow, or optimize for maximum revenue.

Setting up selective blocking AI bots is straightforward. You can allow search engine crawlers while blocking AI training bots using targeted directives:

robots.txt

# Allow search engine crawlers
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

The key insight here is understanding which bots serve which purposes. Google-Extended, for example, is specifically used for AI training purposes and is separate from Googlebot. Blocking it won't affect your Google Search visibility.

Does Blocking AI Crawlers Hurt Your SEO?

This is the million-dollar question, and the data provides a clear answer: no.

Research tracking thousands of publisher sites over a year-long period found no statistically significant traffic changes when sites blocked AI crawlers. The average traffic variation between sites that blocked AI bots and those that didn't remained within 1%, which falls within normal fluctuation ranges.

Major publishers have already made this call. Publishers like The New York Times, Wall Street Journal, Vox, and Reuters have all blocked AI crawler access. These organizations understand that protecting content from AI training doesn't mean sacrificing search visibility. Publishers should also be aware of the legal landscape around blocking AI scrapers in 2025 to ensure their approach is legally sound.

The distinction matters because search crawlers and AI training crawlers serve entirely different functions:

Blocking Googlebot: Removes your content from Google Search results. Never do this unless you have very specific reasons.
Blocking GPTBot: Prevents your content from training OpenAI's models. No impact on Google rankings.
Blocking ClaudeBot: Prevents your content from training Anthropic's models. No impact on any search rankings.
Blocking Google-Extended: Prevents content from being used for Google's generative AI training. Google explicitly states this does not affect Search rankings.

Related Content:
Is AI Killing the Open Internet: Explore the broader implications of AI on publisher traffic and the open web ecosystem
The Shift to AI Search: Understand how AI-powered search is changing user behavior and traffic patterns
Future Proof Your Publishing Business: Long-term strategies to thrive as AI reshapes the digital publishing landscape

Why Publishers Should Care About Traffic Protection

Here's where ad monetization enters the picture. Every pageview represents potential revenue through display ads, video units, and direct campaigns. When AI systems answer user questions without sending visitors to your site, those are ad impressions you'll never earn. Publishers looking to maximize the traffic they do retain should explore everything there is to know about video ads for web and app publishers.

AI Overviews in Google Search have accelerated this trend. When users get AI-generated summaries at the top of search results, click-through rates drop from 15% to just 8% when an AI summary is present. That's a 47% reduction in clicks.

The traffic erosion has been particularly devastating for news publishers and established content brands. Analysis shows that 37 of the top 50 U.S. news websites experienced year-over-year traffic declines in May 2025, with only 13 showing growth.

Publishers who depend on search traffic for ad revenue face a challenging calculation:

Fewer site visits: AI summaries reduce the need to visit original sources.
Lower ad impressions: No visit means no ads served.

Protecting your content from being used to train the very systems that reduce your traffic isn't paranoia. It's business sense.

What AI Crawlers Should Publishers Block?

The practical implementation of AI crawler blocking requires understanding which bots to block and which to leave alone. Here's a comprehensive approach:

Crawlers to ALLOW (essential for search visibility):

Googlebot: Google Search indexing
Bingbot: Bing Search indexing
Applebot: Apple Search and Siri (standard version)
DuckDuckBot: DuckDuckGo indexing

Crawlers to BLOCK (AI training purposes):

GPTBot: OpenAI model training
ChatGPT-User: OpenAI live retrieval (optional, based on your preference)
ClaudeBot: Anthropic model training
Google-Extended: Google AI training specifically
CCBot: Common Crawl data collection
Meta-ExternalAgent: Meta AI training
Bytespider: ByteDance data collection
Applebot-Extended: Apple AI training

Implementation Method	Ease of Use	Effectiveness	Considerations
robots.txt directives	Easy	Good for compliant bots	Voluntary compliance only
Server-level blocking	Moderate	High for all bots	Requires technical access
CDN/firewall rules	Easy	High	May require paid tier
WordPress plugins	Very easy	Good	Plugin-dependent

Beyond robots.txt: Additional AI Block Protection Strategies

For publishers wanting more robust protection, several additional options exist beyond basic robots.txt configuration.

CDN-level blocking: Services like Cloudflare now offer one-click AI crawler blocking. More than 1 million customers have enabled this feature. The advantage here is that blocked requests never reach your server at all. As of July 2025, Cloudflare began blocking AI crawlers by default for new websites joining the platform.
Meta tag signals: Some publishers add meta tags indicating their content shouldn't be used for AI training. While adoption is still evolving, tags like "noai" and "noimageai" signal your preferences to crawlers that check for them.
Server-level user-agent blocking: If you have access to your server configuration, you can block specific user-agents at the nginx or Apache level. This provides more enforcement power than robots.txt alone.
Rate limiting: Even for crawlers you allow, implementing rate limits prevents any single bot from overwhelming your server resources. This is especially important for publishers already focused on Core Web Vitals and how loading ads affects publisher site performance.

Visit the AI Blocking resource center.

Maximizing Revenue on the Traffic You Keep

Protecting traffic is only half the equation. The visitors you successfully attract need to generate maximum value. This is where thoughtful ad monetization strategy becomes critical. Publishers future-proofing their content strategy for the AI era are finding that content quality and monetization optimization go hand in hand.

Publishers who maintain strong traffic despite AI-related industry headwinds share common characteristics. They focus on content that requires human nuance and expertise. They build direct relationships with readers who return regularly. They optimize ad layouts for both revenue and user experience.

The publishers seeing the best results treat yield optimization as an ongoing discipline rather than a set-it-and-forget-it configuration. They test ad placements, monitor viewability metrics, and adjust strategies based on real performance data. Understanding video header bidding and what publishers need to know can significantly boost revenue on remaining traffic.

High-impact ad units consistently outperform standard display in this environment. Video ads, interactive formats, and premium placements command higher CPMs because they deliver genuine engagement that advertisers value. For publishers exploring advanced targeting to maximize the value of each visitor, the no-BS guide to contextual advertising provides strategies that don't rely on third-party cookies.

Frequently Asked Questions About Blocking AI Crawlers

Will blocking GPTBot hurt my Google rankings?

No. GPTBot is OpenAI's training crawler and operates independently of Googlebot. Blocking GPTBot has no impact on your Google Search rankings or indexation. Google's technical documentation also clarifies that blocking Google-Extended doesn't impact Google Search crawling or ranking.

What is the crawl-to-referral ratio and why does it matter?

The crawl-to-referral ratio measures how many pages a platform crawls compared to how often it drives users to a website. A high ratio means heavy crawling but little referral traffic. In June 2025, Anthropic's crawl-to-referral ratio was 73,000:1, while Google Search's was 14:1. This highlights the imbalance between how much content AI systems consume and how little traffic they return.

Should I allow any AI crawlers at all?

It depends on your goals. Some publishers allow AI search crawlers like PerplexityBot or OAI-SearchBot selectively if they want visibility in AI search results. The key is distinguishing between crawlers used for training (which provide almost no traffic return) versus those used for search functionality (which may send some visitors your way).

How often should I update my AI blocking strategy?

Review your server logs and update your blocklist at least every few months. New AI crawlers appear constantly, and the landscape evolves rapidly. Regular monitoring helps you stay ahead of new bots that may be scraping your content.

Next Steps:
10 SEO Tips to Increase Blog Traffic: Now that you've protected your content, optimize your SEO strategy for maximum visibility
A Guide to Increasing Website Traffic: Comprehensive strategies to grow the traffic you're now protecting from AI scraping

Amplify Your Ad Revenue with Playwire

Traffic protection matters. So does making the most of every visitor who reaches your site.

Playwire's RAMP Platform gives publishers the technology and expertise to maximize ad revenue without sacrificing user experience. Our proprietary AI and machine learning algorithms optimize every impression, while our global direct sales team connects you with premium brand campaigns that drive significantly higher CPMs than programmatic alone.

What sets Playwire apart:

AI-Powered Optimization: Machine learning algorithms that maximize yield on every impression, analyzing millions of data points in real time.
Expert Support: Direct access to yield optimization professionals who obsess over your revenue as much as you do.
Comprehensive Analytics: Real-time reporting that shows exactly how your content drives revenue, down to the individual page level.
Premium Demand: Access to direct brand relationships that consistently deliver 10-20x higher CPMs than standard programmatic.

Ready to see how much more your traffic could earn? Apply now to learn what Playwire can do for your ad revenue.

Share this article