How to Block AI Bots with robots.txt: The Complete Publisher's Guide
December 8, 2025
Editorial Policy
All of our content is generated by subject matter experts with years of ad tech experience and structured by writers and educators for ease of use and digestibility. Learn more about our rigorous interview, content production and review process here.
Key Points
- Your robots.txt file is the first line of defense for controlling which AI crawlers can access your content, with major bots like GPTBot, ClaudeBot, and PerplexityBot respecting these directives.
- AI crawler traffic has surged dramatically in 2025, with training-related crawling now accounting for nearly 80% of all AI bot activity according to Cloudflare data.
- Google-Extended is a control token rather than a traditional bot, meaning it won't appear in your server logs even when properly configured.
- Blocking AI training bots won't impact your traditional SEO rankings, but you'll need additional enforcement methods beyond robots.txt for non-compliant crawlers.
- Regular testing and monitoring of your robots.txt configuration is essential since AI companies frequently introduce new crawler user agents.
Why Publishers Need to Care About AI Crawlers
The AI crawler landscape has transformed since early 2024. What was once a minor footnote in server logs has become a significant source of traffic and bandwidth consumption. Publishers are now facing a critical decision: let AI bots scrape content freely for model training, or take control over how their intellectual property gets used.
This decision has real implications for your monetization strategy. Your content is what drives traffic to your site. Traffic drives ad impressions. Ad impressions drive revenue. Understanding how programmatic monetization works and how to maximize your ad revenue becomes even more critical when AI crawlers threaten to siphon away potential visitors.
If your content ends up training AI models that then serve answers directly to users, those users may never visit your site. That's potential ad revenue walking out the door before it ever arrives. Research from Cloudflare shows crawl-to-refer ratios as high as 70,900:1 for some AI platforms, meaning for every visitor they send back, they're crawling nearly 71,000 pages.
Understanding AI Crawler Types and Purposes
AI crawlers aren't all created equal. Some companies operate multiple bots with different purposes, and understanding these distinctions is crucial for making informed decisions about what to block AI bots from accessing. For a deeper dive into every crawler you need to know about, check out our complete list of AI crawlers and how to block each one.
The major categories break down into three distinct groups. Training crawlers collect content to feed into model training datasets. Search crawlers index content for AI-powered search results. User-triggered agents fetch content when users specifically request it through chat interfaces.
OpenAI operates the most diverse crawler fleet. GPTBot handles bulk training data collection and saw a 305% increase in request volume between May 2024 and May 2025. OAI-SearchBot indexes content for ChatGPT search features. ChatGPT-User activates only when a human explicitly requests content through the interface.
Company | Training Crawler | Search Crawler | User-Triggered Agent |
OpenAI | GPTBot | OAI-SearchBot | ChatGPT-User |
Anthropic | anthropic-ai, ClaudeBot | — | claude-web |
Google-Extended | — | (Uses standard Googlebot) | |
Perplexity | PerplexityBot | PerplexityBot | Perplexity-User |
Meta | Meta-ExternalAgent | — | Meta-ExternalFetcher |
Apple | Applebot-Extended | — | Applebot |
The Complete User Agent Reference for AI Bots
Here's the comprehensive list of user agent strings you'll need to block AI bots effectively. The robots.txt user agent token is what you'll add to your configuration file when you want to block AI crawlers from accessing your content.
Company | Bot Name | User Agent Token | Purpose |
OpenAI | GPTBot | GPTBot | Model training data collection |
OpenAI | OAI-SearchBot | OAI-SearchBot | ChatGPT search indexing |
OpenAI | ChatGPT-User | ChatGPT-User | User-initiated content fetch |
Anthropic | ClaudeBot | ClaudeBot | Chat citation fetch |
Anthropic | Claude-Web | Claude-Web | Web-focused crawl |
Anthropic | Anthropic AI | anthropic-ai | Bulk model training |
Google-Extended | Google-Extended | Gemini AI training data | |
Vertex AI | Google-CloudVertexBot | Vertex AI agents | |
Perplexity | PerplexityBot | PerplexityBot | AI search indexing |
Perplexity | Perplexity-User | Perplexity-User | Human-triggered visits |
Meta | External Agent | Meta-ExternalAgent | AI model training |
Meta | External Fetcher | Meta-ExternalFetcher | User-initiated fetches |
Apple | Applebot-Extended | Applebot-Extended | Apple Intelligence training |
Amazon | Amazonbot | Amazonbot | Alexa and model training |
Common Crawl | CCBot | CCBot | Open dataset for LLM training |
Cohere | cohere-ai | cohere-ai | Model training |
ByteDance | Bytespider | Bytespider | TikTok and model training |
DuckDuckGo | DuckAssistBot | DuckAssistBot | AI search answers |
You.com | YouBot | YouBot | AI search functionality |
Need a Primer? Read this first:
- Programmatic Monetization Guide: Understand how your content drives traffic, impressions, and revenue before protecting it
How to Block AI Bots with robots.txt
The robots.txt file lives in your website's root directory and tells crawlers which parts of your site they can access. The syntax is straightforward: specify a user agent, then define what's allowed or disallowed. If you're looking for a broader technical implementation strategy beyond just robots.txt, our guide on how to block AI from scraping your website covers server-level and application-layer methods as well.
A complete block of all major AI training crawlers looks like this:
# Block AI Training Crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Google-CloudVertexBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: YouBot
Disallow: /
User-agent: DuckAssistBot
Disallow: /
# Allow traditional search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# Sitemap reference
Sitemap: https://yoursite.com/sitemap.xml
Selective Blocking: A Nuanced Approach
Complete blocking may not align with every publisher's strategy. Some publishers want their content appearing in AI search results while preventing training data collection. For a comprehensive breakdown of which bots to allow and which to block based on your specific goals, see our guide on selective AI blocking strategies that allow beneficial bots while blocking harmful ones.
This balanced configuration allows that distinction:
# Allow AI Search Crawlers
User-agent: OAI-SearchBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: ChatGPT-User
Allow: /
# Block Training Crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
This approach lets you maintain visibility in AI-powered search results (which can drive referral traffic) while protecting your content from being absorbed into training datasets.
Common robots.txt Mistakes That Will Wreck Your Configuration
Even experienced webmasters make syntax errors that render their robots.txt ineffective. These mistakes are surprisingly common and frustratingly silent, since crawlers won't tell you they're ignoring your malformed directives.
Mistake #1: Standalone User-Agent Lines
Every User-agent line needs at least one Allow or Disallow directive following it. A user agent name by itself accomplishes nothing.
# Wrong - This does nothing
User-agent: GPTBot
# Correct
User-agent: GPTBot
Disallow: /
Mistake #2: Wrong File Location or Naming
Your robots.txt must live at the exact URL: https://yoursite.com/robots.txt. The filename is case-sensitive on many servers. ROBOTS.TXT or Robots.txt won't work.
Mistake #3: Blank Lines Within Rule Blocks
Blank lines signal the end of a rule block. Adding one between your User-agent and Disallow directive breaks the association.
# Wrong - Blank line breaks the rule
User-agent: GPTBot
Disallow: /
# Correct - No blank line within the block
User-agent: GPTBot
Disallow: /
Mistake #4: Case Sensitivity in Paths
Bots treat folder and file names as case-sensitive. If your folder is named /Blog/ but you block /blog/, the crawler ignores your directive entirely.
Mistake #5: Using Wildcards Incorrectly
The asterisk wildcard works in user agent names but behaves differently in URL patterns. Test your wildcard rules carefully before deploying.
Related Content:
- The Complete List of AI Crawlers: Deep dive into every AI crawler you need to know about and their purposes
- How to Block AI from Scraping Your Website: Server-level and application-layer blocking methods beyond robots.txt
- Selective AI Blocking Strategies: Allow beneficial bots while blocking harmful ones based on your goals
- Using Cloudflare to Block AI Crawlers: Step-by-step WAF configuration for stronger AI crawler enforcement
The Google-Extended Exception You Need to Know
Google-Extended deserves special attention because it behaves differently than other crawlers. This is a control token rather than an actual bot with its own user agent string.
Google-Extended controls whether your content gets used for Gemini AI training. The actual crawling still happens through standard Googlebot user agent strings. This means you won't see Google-Extended in your server logs even when it's actively respecting your robots.txt directive.
There's a catch here. Some publishers report that blocking Google-Extended may affect their appearance in Google's "Grounding with Google Search" feature for Gemini. This could potentially impact citations to your pages in AI-generated responses.
Important Note: Google AI Overviews (the AI-generated summaries in search results) use standard Googlebot rules. If your content is accessible to Google Search, it's currently also accessible to AI Overviews. Blocking Google-Extended won't change that.
Testing Your Configuration Before Deployment
Never deploy a robots.txt update without testing. One wrong directive could block important pages from search engines, tanking your organic traffic and, by extension, your ad revenue. This is especially critical if you're managing a complex publisher ad tech stack with multiple monetization touchpoints.
Several reliable testing tools exist for validation:
- Google Search Console: Includes a robots.txt tester for Googlebot-specific validation
- Merkle Robots.txt Tester: Tests individual crawler behavior against specific user agents
- TechnicalSEO.com Robots.txt Tool: Uses Google's open-source parser library for accurate results
- Knowatoa AI Search Console: Tests your configuration against 24 different AI crawlers
A basic validation workflow includes these essential steps. First, upload to staging and test your new robots.txt on a staging environment before production deployment. Then verify critical pages are accessible, checking that your highest-traffic pages, especially article pages that drive ad impressions, remain accessible to search engines.
Confirm AI bots are blocked using testing tools to verify your block directives work as intended. Finally, monitor server logs post-deployment and watch for crawlers that should be blocked but still appear.
When robots.txt Isn't Enough to Block AI Bots
Here's the uncomfortable truth: robots.txt is a gentleman's agreement. Legitimate companies like OpenAI, Anthropic, and Google honor these directives. Less reputable crawlers, or those operated by companies that prioritize data collection over ethics, may ignore your file entirely. According to analysis, robots.txt stops approximately 40-60% of AI bots, with user-agent blocking catching another 30-40%.
For publishers who need stronger enforcement, additional layers of protection exist. If you're using Cloudflare, our step-by-step guide to blocking AI crawlers with Cloudflare's WAF configuration walks you through the setup process:
- Cloudflare Bot Management: Provides AI-specific blocking through their WAF with an "AI scrapers and crawlers" rule category
- Server-level blocking via .htaccess: Blocks requests based on user agent strings before they reach your application
- IP range blocking: Major AI companies publish their crawler IP ranges for verification
- Fail2ban configuration: Automatically bans IPs that match crawler patterns after repeated requests
These methods require more technical expertise to implement correctly. Misconfiguration can accidentally block legitimate users or search engines.
Monitoring What's Actually Hitting Your Site
Your server logs tell the real story of what's crawling your site. Regular monitoring helps you identify new AI crawlers before they consume significant bandwidth and gives you data to make informed blocking decisions.
A simple grep command reveals AI crawler activity:
grep -Ei "gptbot|claudebot|perplexitybot|google-extended|meta-external" access.log | awk '{print $1,$4,$7,$12}' | head
Publishers managing multiple sites should consider automated monitoring solutions that alert when new crawler patterns emerge. The AI crawler landscape changes rapidly, and what works today may need updates within months.
The Revenue Implications of Your Decision
For ad-supported publishers, this decision ultimately comes back to revenue. Your content strategy and traffic patterns should inform your approach to AI crawler management. Whether you're monetizing web properties or building an app monetization strategy, protecting your content from unauthorized training is increasingly important.
Blocking training crawlers prevents your content from being absorbed into AI models that might reduce direct traffic. However, complete blocking also means your site won't appear in AI-powered search results or get cited in AI chat responses.
The most strategic approach for many publishers involves allowing AI search crawlers while blocking training bots. This maintains visibility in AI search results (potential traffic) while protecting content from being used for model training (potential traffic loss). For a comprehensive framework covering all your options, our complete publisher's guide to AI crawlers covers when to block, allow, or optimize for maximum revenue.
Whatever you decide, make it an intentional choice based on your business model rather than accepting the default of unrestricted access. Additionally, implementing proper schema markup on your website can help search engines and AI systems understand your content's structure and attribution requirements.
Frequently Asked Questions
Does blocking AI bots hurt my SEO rankings?
No. Blocking AI training crawlers like GPTBot, ClaudeBot, and CCBot does not affect your Google or Bing search rankings. Traditional search engines use different crawlers (Googlebot, Bingbot) that operate independently. Only block those if you want to disappear from search results entirely.
Which AI bots actually respect robots.txt?
Major crawlers from OpenAI (GPTBot), Anthropic (ClaudeBot), Google (Google-Extended), and Perplexity (PerplexityBot) officially state they respect robots.txt directives. However, smaller or less transparent bots may ignore your configuration, which is why layered protection strategies exist.
Should I block all AI crawlers or just training bots?
It depends on your strategy. Blocking only training crawlers (GPTBot, ClaudeBot, CCBot) protects your content from model training while allowing search-focused crawlers to help you appear in AI search results. Complete blocking removes you from AI ecosystems entirely.
How often do I need to update my robots.txt for new AI bots?
Review your configuration quarterly at minimum. AI companies regularly introduce new crawlers. Anthropic merged their "anthropic-ai" and "Claude-Web" bots into "ClaudeBot," giving the new bot temporary unrestricted access to sites that hadn't updated their rules.
Next Steps:
- Complete AI Blocking Hub: Comprehensive framework for when to block, allow, or optimize for maximum revenue
- Schema Markup for SEO: Help search engines understand your content structure and attribution requirements
Amplify Your Revenue on the Traffic You Keep
Whether you block AI crawlers entirely or take a nuanced approach, one thing remains constant: the traffic that reaches your site needs to generate maximum revenue. That's where your monetization strategy becomes critical. Understanding how header bidding maximizes competition for your ad inventory is essential for extracting full value from every impression.
Playwire's RAMP Platform helps publishers extract maximum value from every session. Our Revenue Intelligence algorithm optimizes ad placements in real time, ensuring you're earning as much as possible from the traffic that makes it through to your site. With AI potentially reducing direct traffic over time, making every pageview count has never been more important.
Ready to make your existing traffic work harder? Contact us to learn how the RAMP Platform can amplify your ad revenue.


