What is Google-Extended and how does it work differently?

Google-Extended is a control token rather than an actual bot with its own user agent string. It controls whether your content gets used for Gemini AI training, but the actual crawling still happens through standard Googlebot user agent strings, so it won't appear in server logs.

What happens if robots.txt isn't enough to block AI bots?

robots.txt is a gentleman's agreement that legitimate companies honor, but some crawlers may ignore it. Additional protection layers include Cloudflare Bot Management, server-level blocking via .htaccess, IP range blocking, and Fail2ban configuration.

Learning Center

How to Block AI Bots with robots.txt: The Complete Publisher's Guide

Playwire Strategy Team

December 8, 2025

Show Editorial Policy

Editorial Policy

All of our content is generated by subject matter experts with years of ad tech experience and structured by writers and educators for ease of use and digestibility. Learn more about our rigorous interview, content production and review process here.

AI Blocking

How to Block AI Bots with robots.txt: The Complete Publisher's Guide

Ready to be powered by Playwire?

Maximize your ad revenue today!

Apply Now

Key Points

Your robots.txt file is the first line of defense for controlling which AI crawlers can access your content, with major bots like GPTBot, ClaudeBot, and PerplexityBot respecting these directives.
AI crawler traffic has surged dramatically in 2025, with training-related crawling now accounting for nearly 80% of all AI bot activity according to Cloudflare data.
Google-Extended is a control token rather than a traditional bot, meaning it won't appear in your server logs even when properly configured.
Blocking AI training bots won't impact your traditional SEO rankings, but you'll need additional enforcement methods beyond robots.txt for non-compliant crawlers.
Regular testing and monitoring of your robots.txt configuration is essential since AI companies frequently introduce new crawler user agents.

Why Publishers Need to Care About AI Crawlers

The AI crawler landscape has transformed since early 2024. What was once a minor footnote in server logs has become a significant source of traffic and bandwidth consumption. Publishers are now facing a critical decision: let AI bots scrape content freely for model training, or take control over how their intellectual property gets used.

This decision has real implications for your monetization strategy. Your content is what drives traffic to your site. Traffic drives ad impressions. Ad impressions drive revenue. Understanding how programmatic monetization works and how to maximize your ad revenue becomes even more critical when AI crawlers threaten to siphon away potential visitors.

If your content ends up training AI models that then serve answers directly to users, those users may never visit your site. That's potential ad revenue walking out the door before it ever arrives. Research from Cloudflare shows crawl-to-refer ratios as high as 70,900:1 for some AI platforms, meaning for every visitor they send back, they're crawling nearly 71,000 pages.

Understanding AI Crawler Types and Purposes

AI crawlers aren't all created equal. Some companies operate multiple bots with different purposes, and understanding these distinctions is crucial for making informed decisions about what to block AI bots from accessing. For a deeper dive into every crawler you need to know about, check out our complete list of AI crawlers and how to block each one.

The major categories break down into three distinct groups. Training crawlers collect content to feed into model training datasets. Search crawlers index content for AI-powered search results. User-triggered agents fetch content when users specifically request it through chat interfaces.

OpenAI operates the most diverse crawler fleet. GPTBot handles bulk training data collection and saw a 305% increase in request volume between May 2024 and May 2025. OAI-SearchBot indexes content for ChatGPT search features. ChatGPT-User activates only when a human explicitly requests content through the interface.

Company	Training Crawler	Search Crawler	User-Triggered Agent
OpenAI	GPTBot	OAI-SearchBot	ChatGPT-User
Anthropic	anthropic-ai, ClaudeBot	—	claude-web
Google	Google-Extended	—	(Uses standard Googlebot)
Perplexity	PerplexityBot	PerplexityBot	Perplexity-User
Meta	Meta-ExternalAgent	—	Meta-ExternalFetcher
Apple	Applebot-Extended	—	Applebot

The Complete User Agent Reference for AI Bots

Here's the comprehensive list of user agent strings you'll need to block AI bots effectively. The robots.txt user agent token is what you'll add to your configuration file when you want to block AI crawlers from accessing your content.

Company	Bot Name	User Agent Token	Purpose
OpenAI	GPTBot	GPTBot	Model training data collection
OpenAI	OAI-SearchBot	OAI-SearchBot	ChatGPT search indexing
OpenAI	ChatGPT-User	ChatGPT-User	User-initiated content fetch
Anthropic	ClaudeBot	ClaudeBot	Chat citation fetch
Anthropic	Claude-Web	Claude-Web	Web-focused crawl
Anthropic	Anthropic AI	anthropic-ai	Bulk model training
Google	Google-Extended	Google-Extended	Gemini AI training data
Google	Vertex AI	Google-CloudVertexBot	Vertex AI agents
Perplexity	PerplexityBot	PerplexityBot	AI search indexing
Perplexity	Perplexity-User	Perplexity-User	Human-triggered visits
Meta	External Agent	Meta-ExternalAgent	AI model training
Meta	External Fetcher	Meta-ExternalFetcher	User-initiated fetches
Apple	Applebot-Extended	Applebot-Extended	Apple Intelligence training
Amazon	Amazonbot	Amazonbot	Alexa and model training
Common Crawl	CCBot	CCBot	Open dataset for LLM training
Cohere	cohere-ai	cohere-ai	Model training
ByteDance	Bytespider	Bytespider	TikTok and model training
DuckDuckGo	DuckAssistBot	DuckAssistBot	AI search answers
You.com	YouBot	YouBot	AI search functionality

Need a Primer? Read this first:
Programmatic Monetization Guide: Understand how your content drives traffic, impressions, and revenue before protecting it

How to Block AI Bots with robots.txt

The robots.txt file lives in your website's root directory and tells crawlers which parts of your site they can access. The syntax is straightforward: specify a user agent, then define what's allowed or disallowed. If you're looking for a broader technical implementation strategy beyond just robots.txt, our guide on how to block AI from scraping your website covers server-level and application-layer methods as well.

A complete block of all major AI training crawlers looks like this:

robots.txt

# Block AI Training Crawlers

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Google-CloudVertexBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: YouBot
Disallow: /

User-agent: DuckAssistBot
Disallow: /

# Allow traditional search engines

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Sitemap reference

Sitemap: https://yoursite.com/sitemap.xml

Selective Blocking: A Nuanced Approach

Complete blocking may not align with every publisher's strategy. Some publishers want their content appearing in AI search results while preventing training data collection. For a comprehensive breakdown of which bots to allow and which to block based on your specific goals, see our guide on selective AI blocking strategies that allow beneficial bots while blocking harmful ones.

This balanced configuration allows that distinction:

robots.txt

# Allow AI Search Crawlers

User-agent: OAI-SearchBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: ChatGPT-User
Allow: /

# Block Training Crawlers

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

This approach lets you maintain visibility in AI-powered search results (which can drive referral traffic) while protecting your content from being absorbed into training datasets.

Common robots.txt Mistakes That Will Wreck Your Configuration

Even experienced webmasters make syntax errors that render their robots.txt ineffective. These mistakes are surprisingly common and frustratingly silent, since crawlers won't tell you they're ignoring your malformed directives.

Mistake #1: Standalone User-Agent Lines

Every User-agent line needs at least one Allow or Disallow directive following it. A user agent name by itself accomplishes nothing.

plaintext

# Wrong - This does nothing
User-agent: GPTBot

# Correct
User-agent: GPTBot
Disallow: /

Mistake #2: Wrong File Location or Naming

Your robots.txt must live at the exact URL: https://yoursite.com/robots.txt. The filename is case-sensitive on many servers. ROBOTS.TXT or Robots.txt won't work.

Mistake #3: Blank Lines Within Rule Blocks

Blank lines signal the end of a rule block. Adding one between your User-agent and Disallow directive breaks the association.

plaintext

# Wrong - Blank line breaks the rule

User-agent: GPTBot

Disallow: /

# Correct - No blank line within the block

User-agent: GPTBot
Disallow: /

Mistake #4: Case Sensitivity in Paths

Bots treat folder and file names as case-sensitive. If your folder is named /Blog/ but you block /blog/, the crawler ignores your directive entirely.

Mistake #5: Using Wildcards Incorrectly

The asterisk wildcard works in user agent names but behaves differently in URL patterns. Test your wildcard rules carefully before deploying.

Related Content:
The Complete List of AI Crawlers: Deep dive into every AI crawler you need to know about and their purposes
How to Block AI from Scraping Your Website: Server-level and application-layer blocking methods beyond robots.txt
Selective AI Blocking Strategies: Allow beneficial bots while blocking harmful ones based on your goals
Using Cloudflare to Block AI Crawlers: Step-by-step WAF configuration for stronger AI crawler enforcement

The Google-Extended Exception You Need to Know

Google-Extended deserves special attention because it behaves differently than other crawlers. This is a control token rather than an actual bot with its own user agent string.

Google-Extended controls whether your content gets used for Gemini AI training. The actual crawling still happens through standard Googlebot user agent strings. This means you won't see Google-Extended in your server logs even when it's actively respecting your robots.txt directive.

There's a catch here. Some publishers report that blocking Google-Extended may affect their appearance in Google's "Grounding with Google Search" feature for Gemini. This could potentially impact citations to your pages in AI-generated responses.

Important Note: Google AI Overviews (the AI-generated summaries in search results) use standard Googlebot rules. If your content is accessible to Google Search, it's currently also accessible to AI Overviews. Blocking Google-Extended won't change that.

Visit the AI Blocking resource center.

Testing Your Configuration Before Deployment

Never deploy a robots.txt update without testing. One wrong directive could block important pages from search engines, tanking your organic traffic and, by extension, your ad revenue. This is especially critical if you're managing a complex publisher ad tech stack with multiple monetization touchpoints.

Several reliable testing tools exist for validation:

Google Search Console: Includes a robots.txt tester for Googlebot-specific validation
Merkle Robots.txt Tester: Tests individual crawler behavior against specific user agents
TechnicalSEO.com Robots.txt Tool: Uses Google's open-source parser library for accurate results
Knowatoa AI Search Console: Tests your configuration against 24 different AI crawlers

A basic validation workflow includes these essential steps. First, upload to staging and test your new robots.txt on a staging environment before production deployment. Then verify critical pages are accessible, checking that your highest-traffic pages, especially article pages that drive ad impressions, remain accessible to search engines.

Confirm AI bots are blocked using testing tools to verify your block directives work as intended. Finally, monitor server logs post-deployment and watch for crawlers that should be blocked but still appear.

When robots.txt Isn't Enough to Block AI Bots

Here's the uncomfortable truth: robots.txt is a gentleman's agreement. Legitimate companies like OpenAI, Anthropic, and Google honor these directives. Less reputable crawlers, or those operated by companies that prioritize data collection over ethics, may ignore your file entirely. According to analysis, robots.txt stops approximately 40-60% of AI bots, with user-agent blocking catching another 30-40%.

For publishers who need stronger enforcement, additional layers of protection exist. If you're using Cloudflare, our step-by-step guide to blocking AI crawlers with Cloudflare's WAF configuration walks you through the setup process:

Cloudflare Bot Management: Provides AI-specific blocking through their WAF with an "AI scrapers and crawlers" rule category
Server-level blocking via .htaccess: Blocks requests based on user agent strings before they reach your application
IP range blocking: Major AI companies publish their crawler IP ranges for verification
Fail2ban configuration: Automatically bans IPs that match crawler patterns after repeated requests

These methods require more technical expertise to implement correctly. Misconfiguration can accidentally block legitimate users or search engines.

Monitoring What's Actually Hitting Your Site

Your server logs tell the real story of what's crawling your site. Regular monitoring helps you identify new AI crawlers before they consume significant bandwidth and gives you data to make informed blocking decisions.

A simple grep command reveals AI crawler activity:

bash

grep -Ei "gptbot|claudebot|perplexitybot|google-extended|meta-external" access.log | awk '{print $1,$4,$7,$12}' | head

Publishers managing multiple sites should consider automated monitoring solutions that alert when new crawler patterns emerge. The AI crawler landscape changes rapidly, and what works today may need updates within months.

The Revenue Implications of Your Decision

For ad-supported publishers, this decision ultimately comes back to revenue. Your content strategy and traffic patterns should inform your approach to AI crawler management. Whether you're monetizing web properties or building an app monetization strategy, protecting your content from unauthorized training is increasingly important.

Blocking training crawlers prevents your content from being absorbed into AI models that might reduce direct traffic. However, complete blocking also means your site won't appear in AI-powered search results or get cited in AI chat responses.

The most strategic approach for many publishers involves allowing AI search crawlers while blocking training bots. This maintains visibility in AI search results (potential traffic) while protecting content from being used for model training (potential traffic loss). For a comprehensive framework covering all your options, our complete publisher's guide to AI crawlers covers when to block, allow, or optimize for maximum revenue.

Whatever you decide, make it an intentional choice based on your business model rather than accepting the default of unrestricted access. Additionally, implementing proper schema markup on your website can help search engines and AI systems understand your content's structure and attribution requirements.

Frequently Asked Questions

Does blocking AI bots hurt my SEO rankings?

No. Blocking AI training crawlers like GPTBot, ClaudeBot, and CCBot does not affect your Google or Bing search rankings. Traditional search engines use different crawlers (Googlebot, Bingbot) that operate independently. Only block those if you want to disappear from search results entirely.

Which AI bots actually respect robots.txt?

Major crawlers from OpenAI (GPTBot), Anthropic (ClaudeBot), Google (Google-Extended), and Perplexity (PerplexityBot) officially state they respect robots.txt directives. However, smaller or less transparent bots may ignore your configuration, which is why layered protection strategies exist.

Should I block all AI crawlers or just training bots?

It depends on your strategy. Blocking only training crawlers (GPTBot, ClaudeBot, CCBot) protects your content from model training while allowing search-focused crawlers to help you appear in AI search results. Complete blocking removes you from AI ecosystems entirely.

How often do I need to update my robots.txt for new AI bots?

Review your configuration quarterly at minimum. AI companies regularly introduce new crawlers. Anthropic merged their "anthropic-ai" and "Claude-Web" bots into "ClaudeBot," giving the new bot temporary unrestricted access to sites that hadn't updated their rules.

Next Steps:
Complete AI Blocking Hub: Comprehensive framework for when to block, allow, or optimize for maximum revenue
Schema Markup for SEO: Help search engines understand your content structure and attribution requirements

Amplify Your Revenue on the Traffic You Keep

Whether you block AI crawlers entirely or take a nuanced approach, one thing remains constant: the traffic that reaches your site needs to generate maximum revenue. That's where your monetization strategy becomes critical. Understanding how header bidding maximizes competition for your ad inventory is essential for extracting full value from every impression.

Playwire's RAMP Platform helps publishers extract maximum value from every session. Our Revenue Intelligence algorithm optimizes ad placements in real time, ensuring you're earning as much as possible from the traffic that makes it through to your site. With AI potentially reducing direct traffic over time, making every pageview count has never been more important.

Ready to make your existing traffic work harder? Contact us to learn how the RAMP Platform can amplify your ad revenue.

Share this article