The Complete List of AI Crawlers and How to Block Each One
December 8, 2025
Editorial Policy
All of our content is generated by subject matter experts with years of ad tech experience and structured by writers and educators for ease of use and digestibility. Learn more about our rigorous interview, content production and review process here.
Key Points
- AI crawlers now account for nearly 80% of all AI bot traffic to websites, with training crawlers consuming content while sending minimal referral traffic back to publishers.
- OpenAI, Anthropic, Google, Meta, Apple, and Amazon each operate multiple crawlers with distinct purposes, from model training to real-time search functionality.
- Robots.txt blocking is the first line of defense, but verification through IP allowlisting provides stronger protection against spoofed user agents.
- Publishers must weigh the trade-offs carefully: blocking training crawlers protects content while blocking search crawlers may reduce visibility in AI-powered discovery platforms.
- This directory includes ready-to-use robots.txt snippets for every major AI crawler, organized by company and purpose.
Why Publishers Need an AI Blocker Strategy
The relationship between publishers and crawlers has fundamentally changed. Traditional search engines operated on a symbiotic model: they crawled your content, indexed it, and sent visitors your way when users searched for relevant information. AI crawlers have flipped this arrangement on its head.
Cloudflare's data reveals the stark imbalance in this new reality. For every referral Anthropic sends back to a website, its crawlers have already visited approximately 38,000 pages. OpenAI's ratio sits around 400:1. These platforms consume vast amounts of publisher content to train models and power AI-generated responses, often without users ever clicking through to the source.
The impact on publisher traffic is real and measurable. Reports from Digital Content Next indicate that AI overviews and chat-based responses are contributing to traffic declines ranging from 9% to 25% for news and content sites.
For publishers who depend on traffic-based ad revenue, understanding which crawlers are hitting your site and deciding which to block has become a critical business decision. Our complete publisher's guide to AI crawlers covers the strategic framework for deciding whether to block, allow, or optimize your approach to these bots.
Need a Primer? Read these first:
- The Publisher's Guide to AI Crawlers: Strategic framework for deciding whether to block, allow, or optimize your approach to AI bots
- How AI Crawling Affects Your Ad Revenue: Data-driven analysis of traffic and monetization impacts from AI crawler activity
Understanding AI Crawler Categories
Before diving into the comprehensive list of AI sites to block, you need to understand what these crawlers actually do. AI crawlers fall into three distinct categories, each with different implications for your content and traffic.
Training Crawlers
Training crawlers collect web content to build datasets for large language model development. This is the most aggressive category, accounting for roughly 80% of all AI crawler traffic according to Cloudflare's analysis. Once your content enters a training dataset, it becomes part of the model's knowledge base, potentially reducing users' need to visit your site for answers.
These crawlers operate with high volume and systematic crawling patterns. The content they collect is used for model improvement, and they return minimal to zero referral traffic back to publishers. Understanding how AI crawling affects your ad revenue through measurable traffic and monetization impacts helps publishers quantify what's at stake.
Search and Citation Crawlers
Search crawlers index content for AI-powered search experiences and citation purposes. When users ask questions in ChatGPT or Perplexity, these crawlers help surface relevant sources. Unlike training crawlers, search crawlers may actually send some traffic back to publishers through citations.
These operate at moderate volume with retrieval-focused behavior. They may include attribution and links, offering some referral traffic potential for publishers who remain accessible.
User-Triggered Fetchers
These crawlers activate when users specifically request content through AI assistants. When someone pastes a URL into ChatGPT or asks Perplexity to analyze a specific page, these fetchers retrieve the content on demand.
User-triggered fetchers operate at lower volume with one-off requests that are user-initiated rather than automated. Most AI companies confirm these are not used for model training.
The Complete Directory: AI Crawlers by Company
The following sections provide a comprehensive reference of known AI crawlers, organized by operating company. Each entry includes the user agent token, purpose, and ready-to-use robots.txt syntax for your AI blocker implementation.
OpenAI Crawlers
OpenAI operates three primary crawlers, each serving distinct functions within the ChatGPT ecosystem.
User Agent | Purpose | Used for Training | Robots.txt Syntax |
GPTBot | Model training data collection | Yes |
|
OAI-SearchBot | Real-time search indexing for ChatGPT | No |
|
ChatGPT-User | On-demand content fetching when users request URLs | No |
|
GPTBot is the primary training crawler. Blocking this prevents your content from being used in future model training. OpenAI publishes IP addresses for verification at https://openai.com/gptbot.json.
OAI-SearchBot handles real-time retrieval for ChatGPT's search features. OpenAI states this crawler does not collect training data. Blocking it may reduce your visibility in ChatGPT search results.
ChatGPT-User activates when users specifically request content. This fetcher makes one-off visits rather than systematic crawls. OpenAI confirms content accessed via this agent is not used for training.
Anthropic Crawlers
Anthropic operates multiple crawlers for Claude AI, though their documentation has been less comprehensive than OpenAI's.
User Agent | Purpose | Used for Training | Robots.txt Syntax |
ClaudeBot | Primary training data collection | Yes |
|
anthropic-ai | Bulk model training | Yes |
|
Claude-Web | Web-focused crawling | Likely |
|
ClaudeBot is Anthropic's main web crawler for training Claude models. The full user agent string appears as: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com).
Anthropic's crawl-to-refer ratio is among the highest in the industry. Cloudflare data indicates ratios ranging from 38,000:1 to over 70,000:1 depending on the time period. That means Anthropic crawls significantly more content than it refers back to publishers.
Google Crawlers
Google's AI crawling strategy deserves careful consideration. The company uses specific crawlers for AI training that are distinct from standard search indexing.
User Agent | Purpose | Used for Training | Robots.txt Syntax |
Google-Extended | Gemini AI training data | Yes |
|
GoogleOther | Research and development | Unknown |
|
Google-CloudVertexBot | Cloud AI services | Unknown |
|
Important consideration: Blocking Google-Extended may affect your visibility in Gemini's "Grounding with Google Search" feature, potentially reducing citations in AI-generated responses. However, AI Overviews in Google Search follow standard Googlebot rules. If your content is accessible to regular search, it remains accessible to AI Overviews.
Some webmasters have reported issues when blocking Google-Extended, claiming it affected their regular search indexing. While Google officially states it doesn't impact search rankings, proceed with caution and monitor your search performance if you implement this block.
Meta Crawlers
Meta operates several crawlers across its AI ecosystem, including those supporting Meta AI and its various platforms.
User Agent | Purpose | Used for Training | Robots.txt Syntax |
Meta-ExternalAgent | AI model training | Yes |
|
Meta-ExternalFetcher | Real-time content fetching | No |
|
FacebookBot | Speech recognition training | Yes |
|
Meta-ExternalAgent is Meta's primary training crawler. This bot systematically collects content for training AI models that power Meta AI across Facebook, Instagram, and WhatsApp.
Meta-ExternalFetcher functions similarly to ChatGPT-User, fetching content when users request specific URLs through Meta AI products.
Apple Crawlers
Apple's AI crawling supports Siri, Spotlight, Safari, and the company's broader AI ambitions with Apple Intelligence.
User Agent | Purpose | Used for Training | Robots.txt Syntax |
Applebot | Siri, Spotlight, Safari features | Mixed |
|
Applebot-Extended | Generative AI training | Yes |
|
Apple's documentation states that data crawled by Applebot powers various features across Apple's ecosystem. Applebot-Extended specifically handles content collection for Apple's generative AI models, making it the primary target if you want to block training while maintaining Siri visibility.
Amazon Crawlers
Amazon operates multiple crawlers supporting Alexa, Rufus, and other AI-powered services.
User Agent | Purpose | Used for Training | Robots.txt Syntax |
Amazonbot | General AI improvement, model training | Yes |
|
Amzn-SearchBot | Alexa and Rufus search experiences | Unclear |
|
Amazonbot crawls content to improve Amazon products and may use data for AI model training. Amazon provides IP addresses for verification at https://developer.amazon.com/amazonbot/ip-addresses/.
Amazon's documentation notes they respect robots meta tags including noarchive (do not use for model training), noindex, and none.
Related Content:
- The Legal Landscape for Blocking AI Scrapers: What publishers need to know about their legal options for protecting content
- The Real Cost of Blocking AI: Traffic and revenue impact analysis to inform your blocking decisions
- How to Get AI Tools to Cite Your Website: Alternative strategy for publishers who want AI visibility without unlimited access
- A Guide to Invalid Traffic: Understanding bot traffic and its impact on your monetization
Additional AI Crawlers
Beyond the major tech companies, numerous other organizations operate AI crawlers that publishers should monitor when building their list of AI sites to block.
User Agent | Company | Purpose | Robots.txt Syntax |
PerplexityBot | Perplexity | Search indexing |
|
Perplexity-User | Perplexity | User-requested fetching |
|
CCBot | Common Crawl | Open dataset collection |
|
Bytespider | ByteDance | AI training |
|
cohere-ai | Cohere | LLM training |
|
Diffbot | Diffbot | AI data extraction |
|
YouBot | You.com | AI search |
|
DuckAssistBot | DuckDuckGo | AI-assisted answers |
|
Omgilibot | Webz.io | Data collection for resale |
|
ImagesiftBot | The Hive | Image model training |
|
CCBot deserves special mention. Common Crawl is a nonprofit that creates open web archives used to train many AI models. Blocking CCBot may reduce your content's presence in models that rely on Common Crawl datasets, including some smaller AI companies that don't operate their own crawlers.
Ready-to-Use Robots.txt Configurations
The following configurations provide copy-and-paste solutions for common AI blocker scenarios.
Block All AI Training Crawlers
This configuration blocks crawlers that collect content for model training while potentially allowing search and citation crawlers.
# Block AI Training Crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: ImagesiftBot
Disallow: /
Block All AI Crawlers Comprehensively
For publishers who want maximum protection, this expanded configuration covers the full known list of AI sites to block.
# Comprehensive AI Crawler Block
User-agent: Amazonbot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: DuckAssistBot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: GoogleOther
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: ImagesiftBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Meta-ExternalFetcher
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Timpibot
Disallow: /
User-agent: YouBot
Disallow: /
Selective Blocking: Training Only
This balanced approach blocks training crawlers while allowing search and citation crawlers that may drive referral traffic.
# Block Training, Allow Search/Citation
# Training Crawlers - BLOCKED
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
# Search/Citation Crawlers - ALLOWED
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: DuckAssistBot
Allow: /
Beyond Robots.txt: Stronger Protection Methods
Robots.txt provides a starting point, but it relies on crawlers voluntarily respecting your directives. Some crawlers don't respect robots.txt, and bad actors can spoof user agent strings to bypass restrictions. Publishers seeking stronger protection should consider additional measures. Understanding the legal landscape around blocking AI scrapers helps inform which technical measures you can confidently deploy.
Next Steps:
- Publisher Ad Tech Stack Guide: Understand how each component contributes to revenue and identify optimization opportunities
- Header Bidding Guide: Maximize competition for your inventory with clean traffic data
IP Verification and Firewall Rules
The most reliable method for verifying legitimate crawlers involves checking request IPs against officially published ranges. Major AI companies provide JSON files containing their crawler IP addresses.
Published IP sources include:
- OpenAI:
https://openai.com/gptbot.json,https://openai.com/searchbot.json,https://openai.com/chatgpt-user.json - Amazon:
https://developer.amazon.com/amazonbot/ip-addresses/
Firewall rules can allowlist verified IPs while blocking requests from unverified sources claiming to be AI crawlers. This approach prevents spoofed user agents from bypassing your restrictions.
Server-Level Blocking with .htaccess
For Apache servers, .htaccess rules provide another layer of protection that operates independently of robots.txt compliance.
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|anthropic-ai|Bytespider|CCBot) [NC]
RewriteRule .* - [F,L]
</IfModule>
This returns a 403 Forbidden response to matching user agents, regardless of robots.txt settings.
Meta Tags for Granular Control
Amazon and some other crawlers respect HTML meta tags that provide page-level control.
<meta name="robots" content="noarchive">
The noarchive directive tells crawlers not to use the page for model training while potentially allowing other indexing activities.
The Trade-offs Publishers Must Consider
Blocking AI crawlers isn't a straightforward decision. Publishers must weigh multiple factors when developing their AI blocker strategy. Our analysis of the real cost of blocking AI including traffic and revenue impact provides data to inform this decision.
Visibility in AI-Powered Discovery
AI platforms are increasingly becoming discovery channels. Users asking ChatGPT, Perplexity, or Google's AI features about topics may receive citations to relevant sources. Blocking search crawlers could reduce your visibility in these emerging discovery platforms. Some publishers are exploring how to get AI tools to cite their website as an alternative to blocking.
Server Load and Bandwidth Costs
AI crawlers can generate significant server load. One infrastructure project reported that blocking AI crawlers reduced their bandwidth consumption from 800GB to 200GB daily, saving approximately $1,500 per month. High-traffic publishers may see meaningful cost reductions from selective blocking.
Content Protection vs. Traffic Trade-offs
The core tension remains: training crawlers consume your content to build models that may reduce users' need to visit your site. Search crawlers index content for AI-powered search that may or may not send traffic back. Publishers must decide which trade-offs align with their business model.
Verifying Crawlers Are Respecting Your Blocks
Setting up robots.txt is only the beginning. You need visibility into whether crawlers are actually respecting your directives.
Checking Server Logs
Your server logs reveal exactly which crawlers are accessing your site and what they're requesting. Look for entries containing user agent strings matching the crawlers you've blocked.
For Apache servers, access logs typically live in /var/log/apache2/access.log. Nginx logs are usually at /var/log/nginx/access.log. Filter for AI crawler patterns using grep or your log analysis tool of choice.
If you see requests from blocked crawlers still hitting your content pages, they may not be respecting robots.txt. This is where server-level blocking or firewall rules become necessary.
Using Analytics and Monitoring Tools
Several platforms now offer AI crawler monitoring. Cloudflare Radar tracks AI bot traffic patterns globally and provides insights into which crawlers are most active. For site-specific monitoring, analytics platforms increasingly differentiate bot traffic from human visitors.
Watch for unexpected traffic patterns that might indicate crawler activity. AI crawlers often exhibit bursty behavior, making many requests in short periods before going quiet. This pattern differs from the steady traffic you'd expect from human visitors.
Testing Your Robots.txt
Google Search Console's robots.txt tester validates that your file is properly formatted and shows how Googlebot interprets your rules. While it doesn't test non-Google crawlers, it confirms your syntax is correct.
For a manual test, access your robots.txt file directly at yoursite.com/robots.txt after uploading changes. Verify all user agents and directives appear correctly.
Maintaining Your Crawler Blocklist
The AI crawler landscape evolves rapidly. New crawlers emerge regularly, existing crawlers update their user agents, and companies introduce new bots without notice. Maintaining an effective AI blocker strategy requires ongoing attention.
Here are key monitoring recommendations for keeping your list of AI sites to block current:
- Check server logs regularly. Look for user agent strings containing "bot," "crawler," "spider," or company names like "GPT," "Claude," or "Perplexity."
- Review crawl analytics. Tools like Cloudflare Radar provide visibility into AI crawler traffic patterns and can help identify new crawlers hitting your properties.
- Track industry resources. The ai.robots.txt project on GitHub maintains a community-updated list of known AI crawlers and user agents.
- Test your implementations. Verify that your robots.txt and server-level blocks are working by checking crawler access in your analytics.
- Update quarterly at minimum. New crawlers appear frequently. Schedule regular reviews of your blocklist to catch additions.
Emerging Crawlers to Watch
The AI crawler ecosystem continues expanding. Browser-based AI agents are beginning to emerge from companies like xAI (Grok), Mistral, and others. These agents may use user agent strings like:
- GrokBot/xAI-Grok: xAI's crawler for Grok AI
- MistralAI-User: Mistral's content fetcher
- DeepseekBot: DeepSeek's AI crawler
Some AI browser agents, like OpenAI's Operator and similar products, don't use distinctive user agents. They appear as standard Chrome traffic, making them impossible to block through traditional methods. This represents an emerging challenge for publishers seeking to control AI access to their content.
This directory will be updated regularly as new crawlers are identified and existing ones evolve. Bookmark this resource and check back for additions to the comprehensive list of AI sites to block.
See It In Action:
- Traffic Shaping Revolution: How ML-powered traffic optimization boosted publisher revenue by 12%
How Publishers Can Protect Revenue While Managing AI Crawlers
Protecting your content from unchecked AI scraping is only half the equation. The traffic that does reach your site represents your monetization opportunity. With AI crawlers consuming nearly 80% of bot traffic and referral ratios heavily skewed against publishers, maximizing revenue from every pageview has never been more critical.
Advanced yield optimization helps publishers capture maximum value from the traffic they retain. Real-time analytics showing exactly how your content drives revenue allows smarter decisions about content strategy and crawler access policies. When you understand which pages and traffic sources generate the highest RPMs, you can make informed choices about which crawlers to allow and which to block. Understanding your complete ad tech stack and how each component contributes to revenue helps publishers identify optimization opportunities across their monetization infrastructure.
For publishers running header bidding to maximize competition for their inventory, ensuring that real human traffic, not bots, drives your auction dynamics becomes even more important. Similarly, publishers using ad exchanges to access programmatic demand need clean traffic data to maintain advertiser confidence and premium CPMs.
For publishers managing AI crawler blocking alongside revenue optimization, having expert guidance on balancing traffic protection with monetization makes a significant difference. Yield operations professionals who monitor performance around the clock can catch issues before they impact your bottom line, ensuring that the traffic you do receive generates maximum ad revenue.
Ready to amplify your ad revenue while you focus on protecting your content? Learn how Playwire can help you get more from the traffic you're keeping.
Frequently Asked Questions About AI Crawlers
What is the difference between training crawlers and search crawlers?
Training crawlers like GPTBot and ClaudeBot collect content to build datasets for large language model development. This content becomes part of the AI's knowledge base. Search crawlers like OAI-SearchBot and PerplexityBot index content for AI-powered search experiences and may send referral traffic back to publishers through citations.
Will blocking Google-Extended affect my search rankings?
Google officially states that blocking Google-Extended does not impact search rankings or inclusion in AI Overviews. However, some webmasters have reported concerns, so monitor your search performance after implementing blocks. AI Overviews in Google Search follow standard Googlebot rules, not Google-Extended.
How often should I update my AI crawler blocklist?
New AI crawlers emerge regularly, so review and update your blocklist quarterly at minimum. Track resources like the ai.robots.txt project on GitHub for community-maintained lists. Check server logs monthly to identify new crawlers hitting your site that aren't in your current configuration.
Can AI crawlers ignore robots.txt directives?
Yes, robots.txt is advisory rather than enforceable. Well-behaved crawlers from major companies generally respect robots.txt directives, but some crawlers ignore them. For stronger protection, implement server-level blocking via .htaccess or firewall rules, and verify legitimate crawlers using published IP address ranges.
Should I block all AI crawlers or just training crawlers?
This depends on your business priorities. Blocking training crawlers protects your content from being incorporated into AI models. Blocking search crawlers may reduce your visibility in AI-powered discovery platforms like ChatGPT search or Perplexity. Many publishers opt for selective blocking that targets training crawlers while allowing search and citation crawlers.


