AI Scraping vs. Traditional SEO Crawling: What Publishers Need to Know About Blocking AI
December 8, 2025
Editorial Policy
All of our content is generated by subject matter experts with years of ad tech experience and structured by writers and educators for ease of use and digestibility. Learn more about our rigorous interview, content production and review process here.
Key Points
- Search engine crawlers and AI training crawlers serve fundamentally different purposes: Traditional bots like Googlebot index your content to drive traffic back to your site, while AI crawlers like GPTBot extract your content to train language models.
- Blocking AI crawlers does not harm your SEO: Data from major publisher networks shows no statistically significant traffic changes when publishers block AI training bots while keeping search crawlers enabled.
- Selective blocking AI bots is your strategic advantage: You can maintain full search visibility while protecting your content from being used to train AI systems that may never send visitors your way.
- The robots.txt file remains your primary tool: Properly configured directives let you allow Googlebot while blocking GPTBot, ClaudeBot, and other AI training crawlers.
- Traffic protection directly impacts ad revenue: Every pageview protected from AI-driven zero-click searches represents preserved ad impressions and monetization opportunities.
The Great Bot Divide: Why Blocking AI Matters for Publishers
Here's a truth that keeps publishers up at night: the bots crawling your site today aren't playing by the same rules. Search engine crawlers and AI training crawlers look similar in your server logs, but their intentions couldn't be more different. Understanding this distinction is the difference between protecting your traffic and accidentally tanking your SEO.
The internet has operated on a simple exchange for decades. Search engines index your content and direct users back to your website, generating traffic and ad revenue. AI crawlers have fundamentally broken this social contract.
They extract your content to make their systems smarter, often without sending a single visitor your way. For publishers navigating this new reality, understanding how AI traffic is reshaping SEO and how to optimize for AI referrals has become essential.
Recent data paints a stark picture. Training is the clear leader. Over the past 12 months, 80% of AI crawling was for training, compared with 18% for search and just 2% for user actions. Meanwhile, the crawl-to-referral ratio for some AI companies reaches staggering levels, with Anthropic's Claude showing ratios from 50,000:1 to 70,900:1.
Need a Primer? Read this first:
- AI Traffic is the New SEO: Understand how AI is reshaping SEO and what optimization for AI referrals looks like
- Is AI Killing the Open Internet: Get the big-picture context on how AI is impacting publisher traffic and revenue
- The Shift to AI Search: Learn how AI-powered search is fundamentally changing user behavior and click-through rates
How Search Engine Crawlers Actually Work
Traditional search crawlers like Googlebot have been the backbone of web discovery since the internet's early days. These bots systematically navigate websites by following links, indexing content, and storing information in massive databases that power search results.
The fundamental purpose of a search crawler is retrieval. Googlebot wants to understand your content so it can surface your pages when users search for relevant topics. This creates a mutually beneficial relationship where quality content earns visibility, which drives traffic, which generates revenue. Understanding what publishers and advertisers need to know about programmatic advertising helps contextualize why this traffic matters so much for monetization.
Search crawlers typically follow these behaviors:
- Respectful crawling patterns: They honor robots.txt directives and crawl-delay requests.
- Indexing for retrieval: Content is stored to be matched against search queries.
- Traffic generation: The end goal is directing users to your website.
- Transparent identification: Legitimate search bots identify themselves clearly in user-agent strings.
AI Training Crawlers: A Different Beast Entirely
AI training crawlers represent a paradigm shift in how content is consumed on the web. Bots like GPTBot (OpenAI), ClaudeBot (Anthropic), and Meta-ExternalAgent don't index your content for search results. They extract information to train large language models that power AI assistants and chatbots.
The economics here work entirely in the AI company's favor. Your carefully crafted content becomes training data that helps AI systems generate answers directly to users, often eliminating any reason for those users to visit your site.
It's your expertise, powering their product, without compensation or traffic in return. Publishers weighing their options should understand the real cost of blocking AI crawlers on traffic and revenue before making decisions.
The AI crawler landscape has exploded in recent months. GPTBot's share grew from 2.2% to 7.7% of all crawler traffic, with a 305% rise in requests between May 2024 and May 2025. Meanwhile, some AI crawlers have earned reputations for aggressive behavior, with certain bots making enormous volumes of requests that strain server infrastructure.
Crawler Type | Primary Purpose | Traffic Benefit | Content Usage |
Googlebot | Search indexing | High (drives clicks) | Surfaces pages in SERPs |
Bingbot | Search indexing | Moderate | Surfaces pages in Bing |
GPTBot | LLM training | Minimal to none | Trains ChatGPT models |
ClaudeBot | LLM training | Minimal to none | Trains Claude models |
Meta-ExternalAgent | AI model training | Minimal to none | Powers Meta AI products |
PerplexityBot | AI search/retrieval | Low | Generates AI answers |
How Does Blocking AI Crawlers Work with robots.txt?
Your robots.txt file is the primary tool for controlling crawler access to your site. This simple text file in your website's root directory tells bots which areas they can and cannot access. The catch? Compliance is voluntary.
Well-behaved crawlers from major companies generally honor robots.txt directives. Google, Microsoft, OpenAI, and Anthropic have all publicly stated that their crawlers respect these rules. However, some less scrupulous bots ignore these guidelines entirely, scraping whatever they want regardless of your preferences. For a deeper dive into enforcement options, our complete publisher's guide to AI crawlers covers how to block, allow, or optimize for maximum revenue.
Setting up selective blocking AI bots is straightforward. You can allow search engine crawlers while blocking AI training bots using targeted directives:
# Allow search engine crawlers
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
The key insight here is understanding which bots serve which purposes. Google-Extended, for example, is specifically used for AI training purposes and is separate from Googlebot. Blocking it won't affect your Google Search visibility.
Does Blocking AI Crawlers Hurt Your SEO?
This is the million-dollar question, and the data provides a clear answer: no.
Research tracking thousands of publisher sites over a year-long period found no statistically significant traffic changes when sites blocked AI crawlers. The average traffic variation between sites that blocked AI bots and those that didn't remained within 1%, which falls within normal fluctuation ranges.
Major publishers have already made this call. Publishers like The New York Times, Wall Street Journal, Vox, and Reuters have all blocked AI crawler access. These organizations understand that protecting content from AI training doesn't mean sacrificing search visibility. Publishers should also be aware of the legal landscape around blocking AI scrapers in 2025 to ensure their approach is legally sound.
The distinction matters because search crawlers and AI training crawlers serve entirely different functions:
- Blocking Googlebot: Removes your content from Google Search results. Never do this unless you have very specific reasons.
- Blocking GPTBot: Prevents your content from training OpenAI's models. No impact on Google rankings.
- Blocking ClaudeBot: Prevents your content from training Anthropic's models. No impact on any search rankings.
- Blocking Google-Extended: Prevents content from being used for Google's generative AI training. Google explicitly states this does not affect Search rankings.
Related Content:
- Is AI Killing the Open Internet: Explore the broader implications of AI on publisher traffic and the open web ecosystem
- The Shift to AI Search: Understand how AI-powered search is changing user behavior and traffic patterns
- Future Proof Your Publishing Business: Long-term strategies to thrive as AI reshapes the digital publishing landscape
Why Publishers Should Care About Traffic Protection
Here's where ad monetization enters the picture. Every pageview represents potential revenue through display ads, video units, and direct campaigns. When AI systems answer user questions without sending visitors to your site, those are ad impressions you'll never earn. Publishers looking to maximize the traffic they do retain should explore everything there is to know about video ads for web and app publishers.
AI Overviews in Google Search have accelerated this trend. When users get AI-generated summaries at the top of search results, click-through rates drop from 15% to just 8% when an AI summary is present. That's a 47% reduction in clicks.
The traffic erosion has been particularly devastating for news publishers and established content brands. Analysis shows that 37 of the top 50 U.S. news websites experienced year-over-year traffic declines in May 2025, with only 13 showing growth.
Publishers who depend on search traffic for ad revenue face a challenging calculation:
- Fewer site visits: AI summaries reduce the need to visit original sources.
- Lower ad impressions: No visit means no ads served.
Protecting your content from being used to train the very systems that reduce your traffic isn't paranoia. It's business sense.
What AI Crawlers Should Publishers Block?
The practical implementation of AI crawler blocking requires understanding which bots to block and which to leave alone. Here's a comprehensive approach:
Crawlers to ALLOW (essential for search visibility):
- Googlebot: Google Search indexing
- Bingbot: Bing Search indexing
- Applebot: Apple Search and Siri (standard version)
- DuckDuckBot: DuckDuckGo indexing
Crawlers to BLOCK (AI training purposes):
- GPTBot: OpenAI model training
- ChatGPT-User: OpenAI live retrieval (optional, based on your preference)
- ClaudeBot: Anthropic model training
- Google-Extended: Google AI training specifically
- CCBot: Common Crawl data collection
- Meta-ExternalAgent: Meta AI training
- Bytespider: ByteDance data collection
- Applebot-Extended: Apple AI training
Implementation Method | Ease of Use | Effectiveness | Considerations |
robots.txt directives | Easy | Good for compliant bots | Voluntary compliance only |
Server-level blocking | Moderate | High for all bots | Requires technical access |
CDN/firewall rules | Easy | High | May require paid tier |
WordPress plugins | Very easy | Good | Plugin-dependent |
Beyond robots.txt: Additional AI Block Protection Strategies
For publishers wanting more robust protection, several additional options exist beyond basic robots.txt configuration.
- CDN-level blocking: Services like Cloudflare now offer one-click AI crawler blocking. More than 1 million customers have enabled this feature. The advantage here is that blocked requests never reach your server at all. As of July 2025, Cloudflare began blocking AI crawlers by default for new websites joining the platform.
- Meta tag signals: Some publishers add meta tags indicating their content shouldn't be used for AI training. While adoption is still evolving, tags like "noai" and "noimageai" signal your preferences to crawlers that check for them.
- Server-level user-agent blocking: If you have access to your server configuration, you can block specific user-agents at the nginx or Apache level. This provides more enforcement power than robots.txt alone.
- Rate limiting: Even for crawlers you allow, implementing rate limits prevents any single bot from overwhelming your server resources. This is especially important for publishers already focused on Core Web Vitals and how loading ads affects publisher site performance.
Maximizing Revenue on the Traffic You Keep
Protecting traffic is only half the equation. The visitors you successfully attract need to generate maximum value. This is where thoughtful ad monetization strategy becomes critical. Publishers future-proofing their content strategy for the AI era are finding that content quality and monetization optimization go hand in hand.
Publishers who maintain strong traffic despite AI-related industry headwinds share common characteristics. They focus on content that requires human nuance and expertise. They build direct relationships with readers who return regularly. They optimize ad layouts for both revenue and user experience.
The publishers seeing the best results treat yield optimization as an ongoing discipline rather than a set-it-and-forget-it configuration. They test ad placements, monitor viewability metrics, and adjust strategies based on real performance data. Understanding video header bidding and what publishers need to know can significantly boost revenue on remaining traffic.
High-impact ad units consistently outperform standard display in this environment. Video ads, interactive formats, and premium placements command higher CPMs because they deliver genuine engagement that advertisers value. For publishers exploring advanced targeting to maximize the value of each visitor, the no-BS guide to contextual advertising provides strategies that don't rely on third-party cookies.
Frequently Asked Questions About Blocking AI Crawlers
Will blocking GPTBot hurt my Google rankings?
No. GPTBot is OpenAI's training crawler and operates independently of Googlebot. Blocking GPTBot has no impact on your Google Search rankings or indexation. Google's technical documentation also clarifies that blocking Google-Extended doesn't impact Google Search crawling or ranking.
What is the crawl-to-referral ratio and why does it matter?
The crawl-to-referral ratio measures how many pages a platform crawls compared to how often it drives users to a website. A high ratio means heavy crawling but little referral traffic. In June 2025, Anthropic's crawl-to-referral ratio was 73,000:1, while Google Search's was 14:1. This highlights the imbalance between how much content AI systems consume and how little traffic they return.
Should I allow any AI crawlers at all?
It depends on your goals. Some publishers allow AI search crawlers like PerplexityBot or OAI-SearchBot selectively if they want visibility in AI search results. The key is distinguishing between crawlers used for training (which provide almost no traffic return) versus those used for search functionality (which may send some visitors your way).
How often should I update my AI blocking strategy?
Review your server logs and update your blocklist at least every few months. New AI crawlers appear constantly, and the landscape evolves rapidly. Regular monitoring helps you stay ahead of new bots that may be scraping your content.
Next Steps:
- 10 SEO Tips to Increase Blog Traffic: Now that you've protected your content, optimize your SEO strategy for maximum visibility
- A Guide to Increasing Website Traffic: Comprehensive strategies to grow the traffic you're now protecting from AI scraping
Amplify Your Ad Revenue with Playwire
Traffic protection matters. So does making the most of every visitor who reaches your site.
Playwire's RAMP Platform gives publishers the technology and expertise to maximize ad revenue without sacrificing user experience. Our proprietary AI and machine learning algorithms optimize every impression, while our global direct sales team connects you with premium brand campaigns that drive significantly higher CPMs than programmatic alone.
What sets Playwire apart:
- AI-Powered Optimization: Machine learning algorithms that maximize yield on every impression, analyzing millions of data points in real time.
- Expert Support: Direct access to yield optimization professionals who obsess over your revenue as much as you do.
- Comprehensive Analytics: Real-time reporting that shows exactly how your content drives revenue, down to the individual page level.
- Premium Demand: Access to direct brand relationships that consistently deliver 10-20x higher CPMs than standard programmatic.
Ready to see how much more your traffic could earn? Apply now to learn what Playwire can do for your ad revenue.


