Learning Center

The Complete List of AI Crawlers and How to Block Each One

December 8, 2025

Show Editorial Policy

shield-icon-2

Editorial Policy

All of our content is generated by subject matter experts with years of ad tech experience and structured by writers and educators for ease of use and digestibility. Learn more about our rigorous interview, content production and review process here.

The Complete List of AI Crawlers and How to Block Each One
Ready to be powered by Playwire?

Maximize your ad revenue today!

Apply Now

Key Points

  • AI crawlers now account for nearly 80% of all AI bot traffic to websites, with training crawlers consuming content while sending minimal referral traffic back to publishers.
  • OpenAI, Anthropic, Google, Meta, Apple, and Amazon each operate multiple crawlers with distinct purposes, from model training to real-time search functionality.
  • Robots.txt blocking is the first line of defense, but verification through IP allowlisting provides stronger protection against spoofed user agents.
  • Publishers must weigh the trade-offs carefully: blocking training crawlers protects content while blocking search crawlers may reduce visibility in AI-powered discovery platforms.
  • This directory includes ready-to-use robots.txt snippets for every major AI crawler, organized by company and purpose.

Why Publishers Need an AI Blocker Strategy

The relationship between publishers and crawlers has fundamentally changed. Traditional search engines operated on a symbiotic model: they crawled your content, indexed it, and sent visitors your way when users searched for relevant information. AI crawlers have flipped this arrangement on its head.

Cloudflare's data reveals the stark imbalance in this new reality. For every referral Anthropic sends back to a website, its crawlers have already visited approximately 38,000 pages. OpenAI's ratio sits around 400:1. These platforms consume vast amounts of publisher content to train models and power AI-generated responses, often without users ever clicking through to the source.

The impact on publisher traffic is real and measurable. Reports from Digital Content Next indicate that AI overviews and chat-based responses are contributing to traffic declines ranging from 9% to 25% for news and content sites.

For publishers who depend on traffic-based ad revenue, understanding which crawlers are hitting your site and deciding which to block has become a critical business decision. Our complete publisher's guide to AI crawlers covers the strategic framework for deciding whether to block, allow, or optimize your approach to these bots.

Need a Primer? Read these first:

Understanding AI Crawler Categories

Before diving into the comprehensive list of AI sites to block, you need to understand what these crawlers actually do. AI crawlers fall into three distinct categories, each with different implications for your content and traffic.

Training Crawlers

Training crawlers collect web content to build datasets for large language model development. This is the most aggressive category, accounting for roughly 80% of all AI crawler traffic according to Cloudflare's analysis. Once your content enters a training dataset, it becomes part of the model's knowledge base, potentially reducing users' need to visit your site for answers.

These crawlers operate with high volume and systematic crawling patterns. The content they collect is used for model improvement, and they return minimal to zero referral traffic back to publishers. Understanding how AI crawling affects your ad revenue through measurable traffic and monetization impacts helps publishers quantify what's at stake.

Search and Citation Crawlers

Search crawlers index content for AI-powered search experiences and citation purposes. When users ask questions in ChatGPT or Perplexity, these crawlers help surface relevant sources. Unlike training crawlers, search crawlers may actually send some traffic back to publishers through citations.

These operate at moderate volume with retrieval-focused behavior. They may include attribution and links, offering some referral traffic potential for publishers who remain accessible.

User-Triggered Fetchers

These crawlers activate when users specifically request content through AI assistants. When someone pastes a URL into ChatGPT or asks Perplexity to analyze a specific page, these fetchers retrieve the content on demand.

User-triggered fetchers operate at lower volume with one-off requests that are user-initiated rather than automated. Most AI companies confirm these are not used for model training.

AI Crawler Blocking Decision Tool

The Complete Directory: AI Crawlers by Company

The following sections provide a comprehensive reference of known AI crawlers, organized by operating company. Each entry includes the user agent token, purpose, and ready-to-use robots.txt syntax for your AI blocker implementation.

OpenAI Crawlers

OpenAI operates three primary crawlers, each serving distinct functions within the ChatGPT ecosystem.

User Agent

Purpose

Used for Training

Robots.txt Syntax

GPTBot

Model training data collection

Yes

User-agent: GPTBot Disallow: /

OAI-SearchBot

Real-time search indexing for ChatGPT

No

User-agent: OAI-SearchBot Disallow: /

ChatGPT-User

On-demand content fetching when users request URLs

No

User-agent: ChatGPT-User Disallow: /

GPTBot is the primary training crawler. Blocking this prevents your content from being used in future model training. OpenAI publishes IP addresses for verification at https://openai.com/gptbot.json.

OAI-SearchBot handles real-time retrieval for ChatGPT's search features. OpenAI states this crawler does not collect training data. Blocking it may reduce your visibility in ChatGPT search results.

ChatGPT-User activates when users specifically request content. This fetcher makes one-off visits rather than systematic crawls. OpenAI confirms content accessed via this agent is not used for training.

Anthropic Crawlers

Anthropic operates multiple crawlers for Claude AI, though their documentation has been less comprehensive than OpenAI's.

User Agent

Purpose

Used for Training

Robots.txt Syntax

ClaudeBot

Primary training data collection

Yes

User-agent: ClaudeBot Disallow: /

anthropic-ai

Bulk model training

Yes

User-agent: anthropic-ai Disallow: /

Claude-Web

Web-focused crawling

Likely

User-agent: Claude-Web Disallow: /

ClaudeBot is Anthropic's main web crawler for training Claude models. The full user agent string appears as: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com).

Anthropic's crawl-to-refer ratio is among the highest in the industry. Cloudflare data indicates ratios ranging from 38,000:1 to over 70,000:1 depending on the time period. That means Anthropic crawls significantly more content than it refers back to publishers.

Google Crawlers

Google's AI crawling strategy deserves careful consideration. The company uses specific crawlers for AI training that are distinct from standard search indexing.

User Agent

Purpose

Used for Training

Robots.txt Syntax

Google-Extended

Gemini AI training data

Yes

User-agent: Google-Extended Disallow: /

GoogleOther

Research and development

Unknown

User-agent: GoogleOther Disallow: /

Google-CloudVertexBot

Cloud AI services

Unknown

User-agent: Google-CloudVertexBot Disallow: /

Important consideration: Blocking Google-Extended may affect your visibility in Gemini's "Grounding with Google Search" feature, potentially reducing citations in AI-generated responses. However, AI Overviews in Google Search follow standard Googlebot rules. If your content is accessible to regular search, it remains accessible to AI Overviews.

Some webmasters have reported issues when blocking Google-Extended, claiming it affected their regular search indexing. While Google officially states it doesn't impact search rankings, proceed with caution and monitor your search performance if you implement this block.

Meta Crawlers

Meta operates several crawlers across its AI ecosystem, including those supporting Meta AI and its various platforms.

User Agent

Purpose

Used for Training

Robots.txt Syntax

Meta-ExternalAgent

AI model training

Yes

User-agent: Meta-ExternalAgent Disallow: /

Meta-ExternalFetcher

Real-time content fetching

No

User-agent: Meta-ExternalFetcher Disallow: /

FacebookBot

Speech recognition training

Yes

User-agent: FacebookBot Disallow: /

Meta-ExternalAgent is Meta's primary training crawler. This bot systematically collects content for training AI models that power Meta AI across Facebook, Instagram, and WhatsApp.

Meta-ExternalFetcher functions similarly to ChatGPT-User, fetching content when users request specific URLs through Meta AI products.

Apple Crawlers

Apple's AI crawling supports Siri, Spotlight, Safari, and the company's broader AI ambitions with Apple Intelligence.

User Agent

Purpose

Used for Training

Robots.txt Syntax

Applebot

Siri, Spotlight, Safari features

Mixed

User-agent: Applebot Disallow: /

Applebot-Extended

Generative AI training

Yes

User-agent: Applebot-Extended Disallow: /

Apple's documentation states that data crawled by Applebot powers various features across Apple's ecosystem. Applebot-Extended specifically handles content collection for Apple's generative AI models, making it the primary target if you want to block training while maintaining Siri visibility.

Amazon Crawlers

Amazon operates multiple crawlers supporting Alexa, Rufus, and other AI-powered services.

User Agent

Purpose

Used for Training

Robots.txt Syntax

Amazonbot

General AI improvement, model training

Yes

User-agent: Amazonbot Disallow: /

Amzn-SearchBot

Alexa and Rufus search experiences

Unclear

User-agent: Amzn-SearchBot Disallow: /

Amazonbot crawls content to improve Amazon products and may use data for AI model training. Amazon provides IP addresses for verification at https://developer.amazon.com/amazonbot/ip-addresses/.

Amazon's documentation notes they respect robots meta tags including noarchive (do not use for model training), noindex, and none.

Related Content:

Additional AI Crawlers

Beyond the major tech companies, numerous other organizations operate AI crawlers that publishers should monitor when building their list of AI sites to block.

User Agent

Company

Purpose

Robots.txt Syntax

PerplexityBot

Perplexity

Search indexing

User-agent: PerplexityBot Disallow: /

Perplexity-User

Perplexity

User-requested fetching

User-agent: Perplexity-User Disallow: /

CCBot

Common Crawl

Open dataset collection

User-agent: CCBot Disallow: /

Bytespider

ByteDance

AI training

User-agent: Bytespider Disallow: /

cohere-ai

Cohere

LLM training

User-agent: cohere-ai Disallow: /

Diffbot

Diffbot

AI data extraction

User-agent: Diffbot Disallow: /

YouBot

You.com

AI search

User-agent: YouBot Disallow: /

DuckAssistBot

DuckDuckGo

AI-assisted answers

User-agent: DuckAssistBot Disallow: /

Omgilibot

Webz.io

Data collection for resale

User-agent: Omgilibot Disallow: /

ImagesiftBot

The Hive

Image model training

User-agent: ImagesiftBot Disallow: /

CCBot deserves special mention. Common Crawl is a nonprofit that creates open web archives used to train many AI models. Blocking CCBot may reduce your content's presence in models that rely on Common Crawl datasets, including some smaller AI companies that don't operate their own crawlers.

Ready-to-Use Robots.txt Configurations

The following configurations provide copy-and-paste solutions for common AI blocker scenarios.

Block All AI Training Crawlers

This configuration blocks crawlers that collect content for model training while potentially allowing search and citation crawlers.

robots.txt

# Block AI Training Crawlers

 

User-agent: GPTBot
Disallow: /

 

User-agent: ClaudeBot
Disallow: /

 

User-agent: anthropic-ai
Disallow: /


User-agent: Claude-Web
Disallow: /


User-agent: Google-Extended
Disallow: /


User-agent: Meta-ExternalAgent
Disallow: /


User-agent: FacebookBot
Disallow: /


User-agent: Applebot-Extended
Disallow: /


User-agent: Amazonbot
Disallow: /


User-agent: CCBot
Disallow: /


User-agent: Bytespider
Disallow: /


User-agent: cohere-ai
Disallow: /


User-agent: Diffbot
Disallow: /


User-agent: Omgilibot
Disallow: /


User-agent: ImagesiftBot
Disallow: /

Block All AI Crawlers Comprehensively

For publishers who want maximum protection, this expanded configuration covers the full known list of AI sites to block.

robots.txt

# Comprehensive AI Crawler Block


User-agent: Amazonbot
Disallow: /


User-agent: anthropic-ai
Disallow: /


User-agent: Applebot-Extended
Disallow: /


User-agent: Bytespider
Disallow: /


User-agent: CCBot
Disallow: /


User-agent: ChatGPT-User
Disallow: /


User-agent: Claude-Web
Disallow: /


User-agent: ClaudeBot
Disallow: /


User-agent: cohere-ai
Disallow: /


User-agent: Diffbot
Disallow: /


User-agent: DuckAssistBot
Disallow: /


User-agent: FacebookBot
Disallow: /


User-agent: Google-Extended
Disallow: /


User-agent: GoogleOther
Disallow: /


User-agent: GPTBot
Disallow: /


User-agent: ImagesiftBot
Disallow: /


User-agent: Meta-ExternalAgent
Disallow: /


User-agent: Meta-ExternalFetcher
Disallow: /


User-agent: OAI-SearchBot
Disallow: /


User-agent: Omgilibot
Disallow: /


User-agent: PerplexityBot
Disallow: /


User-agent: Perplexity-User
Disallow: /


User-agent: Timpibot
Disallow: /


User-agent: YouBot
Disallow: /

 

Selective Blocking: Training Only

This balanced approach blocks training crawlers while allowing search and citation crawlers that may drive referral traffic.

robots.txt

# Block Training, Allow Search/Citation

 

# Training Crawlers - BLOCKED

 

User-agent: GPTBot

Disallow: /

 

User-agent: ClaudeBot

Disallow: /

 

User-agent: anthropic-ai

Disallow: /

 

User-agent: Google-Extended

Disallow: /

 

User-agent: Meta-ExternalAgent

Disallow: /

 

User-agent: CCBot

Disallow: /

 

User-agent: Bytespider

Disallow: /

 

# Search/Citation Crawlers - ALLOWED

 

User-agent: OAI-SearchBot

Allow: /

 

User-agent: ChatGPT-User

Allow: /

 

User-agent: PerplexityBot

Allow: /

 

User-agent: Perplexity-User

Allow: /

 

User-agent: DuckAssistBot

Allow: /

 

Beyond Robots.txt: Stronger Protection Methods

Robots.txt provides a starting point, but it relies on crawlers voluntarily respecting your directives. Some crawlers don't respect robots.txt, and bad actors can spoof user agent strings to bypass restrictions. Publishers seeking stronger protection should consider additional measures. Understanding the legal landscape around blocking AI scrapers helps inform which technical measures you can confidently deploy.

Next Steps:

IP Verification and Firewall Rules

The most reliable method for verifying legitimate crawlers involves checking request IPs against officially published ranges. Major AI companies provide JSON files containing their crawler IP addresses.

Published IP sources include:

  • OpenAI: https://openai.com/gptbot.json, https://openai.com/searchbot.json, https://openai.com/chatgpt-user.json
  • Amazon: https://developer.amazon.com/amazonbot/ip-addresses/

Firewall rules can allowlist verified IPs while blocking requests from unverified sources claiming to be AI crawlers. This approach prevents spoofed user agents from bypassing your restrictions.

Server-Level Blocking with .htaccess

For Apache servers, .htaccess rules provide another layer of protection that operates independently of robots.txt compliance.

bash

<IfModule mod_rewrite.c>


RewriteEngine On


RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|anthropic-ai|Bytespider|CCBot) [NC]


RewriteRule .* - [F,L]


</IfModule>

This returns a 403 Forbidden response to matching user agents, regardless of robots.txt settings.

Meta Tags for Granular Control

Amazon and some other crawlers respect HTML meta tags that provide page-level control.

html

<meta name="robots" content="noarchive">

The noarchive directive tells crawlers not to use the page for model training while potentially allowing other indexing activities.

The Trade-offs Publishers Must Consider

Blocking AI crawlers isn't a straightforward decision. Publishers must weigh multiple factors when developing their AI blocker strategy. Our analysis of the real cost of blocking AI including traffic and revenue impact provides data to inform this decision.

Visibility in AI-Powered Discovery

AI platforms are increasingly becoming discovery channels. Users asking ChatGPT, Perplexity, or Google's AI features about topics may receive citations to relevant sources. Blocking search crawlers could reduce your visibility in these emerging discovery platforms. Some publishers are exploring how to get AI tools to cite their website as an alternative to blocking.

Server Load and Bandwidth Costs

AI crawlers can generate significant server load. One infrastructure project reported that blocking AI crawlers reduced their bandwidth consumption from 800GB to 200GB daily, saving approximately $1,500 per month. High-traffic publishers may see meaningful cost reductions from selective blocking.

Content Protection vs. Traffic Trade-offs

The core tension remains: training crawlers consume your content to build models that may reduce users' need to visit your site. Search crawlers index content for AI-powered search that may or may not send traffic back. Publishers must decide which trade-offs align with their business model.

Verifying Crawlers Are Respecting Your Blocks

Setting up robots.txt is only the beginning. You need visibility into whether crawlers are actually respecting your directives.

Checking Server Logs

Your server logs reveal exactly which crawlers are accessing your site and what they're requesting. Look for entries containing user agent strings matching the crawlers you've blocked.

For Apache servers, access logs typically live in /var/log/apache2/access.log. Nginx logs are usually at /var/log/nginx/access.log. Filter for AI crawler patterns using grep or your log analysis tool of choice.

If you see requests from blocked crawlers still hitting your content pages, they may not be respecting robots.txt. This is where server-level blocking or firewall rules become necessary.

Using Analytics and Monitoring Tools

Several platforms now offer AI crawler monitoring. Cloudflare Radar tracks AI bot traffic patterns globally and provides insights into which crawlers are most active. For site-specific monitoring, analytics platforms increasingly differentiate bot traffic from human visitors.

Watch for unexpected traffic patterns that might indicate crawler activity. AI crawlers often exhibit bursty behavior, making many requests in short periods before going quiet. This pattern differs from the steady traffic you'd expect from human visitors.

Testing Your Robots.txt

Google Search Console's robots.txt tester validates that your file is properly formatted and shows how Googlebot interprets your rules. While it doesn't test non-Google crawlers, it confirms your syntax is correct.

For a manual test, access your robots.txt file directly at yoursite.com/robots.txt after uploading changes. Verify all user agents and directives appear correctly.

AI Crawler Grader

Maintaining Your Crawler Blocklist

The AI crawler landscape evolves rapidly. New crawlers emerge regularly, existing crawlers update their user agents, and companies introduce new bots without notice. Maintaining an effective AI blocker strategy requires ongoing attention.

Here are key monitoring recommendations for keeping your list of AI sites to block current:

  • Check server logs regularly. Look for user agent strings containing "bot," "crawler," "spider," or company names like "GPT," "Claude," or "Perplexity."
  • Review crawl analytics. Tools like Cloudflare Radar provide visibility into AI crawler traffic patterns and can help identify new crawlers hitting your properties.
  • Track industry resources. The ai.robots.txt project on GitHub maintains a community-updated list of known AI crawlers and user agents.
  • Test your implementations. Verify that your robots.txt and server-level blocks are working by checking crawler access in your analytics.
  • Update quarterly at minimum. New crawlers appear frequently. Schedule regular reviews of your blocklist to catch additions.

Emerging Crawlers to Watch

The AI crawler ecosystem continues expanding. Browser-based AI agents are beginning to emerge from companies like xAI (Grok), Mistral, and others. These agents may use user agent strings like:

  • GrokBot/xAI-Grok: xAI's crawler for Grok AI
  • MistralAI-User: Mistral's content fetcher
  • DeepseekBot: DeepSeek's AI crawler

Some AI browser agents, like OpenAI's Operator and similar products, don't use distinctive user agents. They appear as standard Chrome traffic, making them impossible to block through traditional methods. This represents an emerging challenge for publishers seeking to control AI access to their content.

This directory will be updated regularly as new crawlers are identified and existing ones evolve. Bookmark this resource and check back for additions to the comprehensive list of AI sites to block.

See It In Action:

How Publishers Can Protect Revenue While Managing AI Crawlers

Protecting your content from unchecked AI scraping is only half the equation. The traffic that does reach your site represents your monetization opportunity. With AI crawlers consuming nearly 80% of bot traffic and referral ratios heavily skewed against publishers, maximizing revenue from every pageview has never been more critical.

Advanced yield optimization helps publishers capture maximum value from the traffic they retain. Real-time analytics showing exactly how your content drives revenue allows smarter decisions about content strategy and crawler access policies. When you understand which pages and traffic sources generate the highest RPMs, you can make informed choices about which crawlers to allow and which to block. Understanding your complete ad tech stack and how each component contributes to revenue helps publishers identify optimization opportunities across their monetization infrastructure.

For publishers running header bidding to maximize competition for their inventory, ensuring that real human traffic, not bots, drives your auction dynamics becomes even more important. Similarly, publishers using ad exchanges to access programmatic demand need clean traffic data to maintain advertiser confidence and premium CPMs.

For publishers managing AI crawler blocking alongside revenue optimization, having expert guidance on balancing traffic protection with monetization makes a significant difference. Yield operations professionals who monitor performance around the clock can catch issues before they impact your bottom line, ensuring that the traffic you do receive generates maximum ad revenue. 

Ready to amplify your ad revenue while you focus on protecting your content? Learn how Playwire can help you get more from the traffic you're keeping.

New call-to-action

Frequently Asked Questions About AI Crawlers

What is the difference between training crawlers and search crawlers?

Training crawlers like GPTBot and ClaudeBot collect content to build datasets for large language model development. This content becomes part of the AI's knowledge base. Search crawlers like OAI-SearchBot and PerplexityBot index content for AI-powered search experiences and may send referral traffic back to publishers through citations.

Will blocking Google-Extended affect my search rankings?

Google officially states that blocking Google-Extended does not impact search rankings or inclusion in AI Overviews. However, some webmasters have reported concerns, so monitor your search performance after implementing blocks. AI Overviews in Google Search follow standard Googlebot rules, not Google-Extended.

How often should I update my AI crawler blocklist?

New AI crawlers emerge regularly, so review and update your blocklist quarterly at minimum. Track resources like the ai.robots.txt project on GitHub for community-maintained lists. Check server logs monthly to identify new crawlers hitting your site that aren't in your current configuration.

Can AI crawlers ignore robots.txt directives?

Yes, robots.txt is advisory rather than enforceable. Well-behaved crawlers from major companies generally respect robots.txt directives, but some crawlers ignore them. For stronger protection, implement server-level blocking via .htaccess or firewall rules, and verify legitimate crawlers using published IP address ranges.

Should I block all AI crawlers or just training crawlers?

This depends on your business priorities. Blocking training crawlers protects your content from being incorporated into AI models. Blocking search crawlers may reduce your visibility in AI-powered discovery platforms like ChatGPT search or Perplexity. Many publishers opt for selective blocking that targets training crawlers while allowing search and citation crawlers.