robots.txt Rules for GPTBot, ClaudeBot, and PerplexityBot: Why Explicit Allow Rules Matter
Reading time: 10 min
Most sites were built when "crawler" meant Googlebot. robots.txt targeted search engines. In 2026 your domain is hit by many AI user-agents — different strings, different jobs, and different defaults when rules are vague or missing.
Block the wrong bot and you vanish from AI answers. Rely on implicit permission and you leave visibility to chance. This article maps the main AI crawlers, explains why explicit Allow lines help, and lists misconfigurations that quietly hurt AI readiness.
The new crawler landscape
The list below is a practical directory — always confirm the latest user-agent strings in each vendor's documentation, because they can change.
AI CRAWLER DIRECTORY (illustrative — verify with vendors) Crawler User-Agent Company Primary role ────────────────────────────────────────────────────────────────── GPTBot GPTBot OpenAI Training / corpus ChatGPT-User ChatGPT-User OpenAI Live browsing for answers ClaudeBot ClaudeBot Anthropic Training / indexing Claude-Web Claude-Web Anthropic Live browsing PerplexityBot PerplexityBot Perplexity Search / indexing Perplexity-User Perplexity-User Perplexity Live answers Google-Extended Google-Extended Google Gemini / AI Overviews training use Gemini Gemini Google Live access (verify naming) Meta-ExternalAgent Meta-ExternalAgent Meta Meta AI cohere-ai cohere-ai Cohere Training Applebot-Extended Applebot-Extended Apple Apple Intelligence YouBot YouBot You.com AI search
These agents generally respect robots.txt — but only when your rules are readable and unambiguous. They do not infer marketing intent. They match user-agent groups and path rules.
Why ambiguity is the real enemy
A legacy file might look fine to humans yet leave AI crawlers in a gray zone:
User-agent: *
Disallow: /admin/
Disallow: /checkout/
User-agent: Googlebot
Allow: /
What many AI crawlers inherit:
└── Wildcard group applies unless a dedicated block exists.
Paths like / may be allowed by omission — but there is no
explicit "Allow: /" for GPTBot, ClaudeBot, etc.
Google-Extended does not use the Googlebot block — it falls through.
The gap between "not explicitly disallowed" and "explicitly allowed" matters for operations teams and for tooling that scores crawl clarity. The fix: name the agents you care about and give each a clear Allow or Disallow.
Two types of AI crawling: training vs real-time
TYPE 1 — TRAINING-LIKE CRAWLERS Examples: GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, … Purpose: Build or refresh model-facing corpora (policy-dependent). If blocked: Your site may be absent from those pipelines. If allowed: Content may be eligible for inclusion per vendor policy. TYPE 2 — REAL-TIME / RETRIEVAL Examples: ChatGPT-User, Claude-Web, PerplexityBot, Perplexity-User, … Purpose: Fetch pages to answer a live user query. If blocked: That assistant may not cite or quote you in real time. If allowed: Pages can appear in answers when retrieved.
Many publishers allow retrieval but restrict training user-agents. That is only possible if those agents are listed separately — which is why copy-paste "block GPTBot" snippets are never a substitute for a thought-through matrix.
Recommended explicit structure for AI visibility
Template for yourdomain.com — adjust paths you truly need to block:
# --- Search (examples) --- User-agent: Googlebot Allow: / User-agent: Bingbot Allow: / # --- OpenAI --- User-agent: ChatGPT-User Allow: / User-agent: GPTBot Allow: / # --- Anthropic --- User-agent: Claude-Web Allow: / User-agent: ClaudeBot Allow: / # --- Perplexity --- User-agent: PerplexityBot Allow: / User-agent: Perplexity-User Allow: / # --- Google AI --- User-agent: Google-Extended Allow: / User-agent: Gemini Allow: / # --- Other AI (add/remove per policy) --- User-agent: Meta-ExternalAgent Allow: / User-agent: cohere-ai Allow: / User-agent: Applebot-Extended Allow: / User-agent: YouBot Allow: / # --- Default --- User-agent: * Disallow: /admin/ Disallow: /checkout/ Disallow: /user/account/ Allow: / Sitemap: https://yourdomain.com/sitemap.xml
Vendor user-agent names evolve — treat this as a starting point and reconcile with official docs before production.
Selective opt-out: retrieval on, training off
A common publisher pattern (intent — confirm compliance with each platform's terms):
User-agent: ChatGPT-User Allow: / User-agent: Claude-Web Allow: / User-agent: PerplexityBot Allow: / User-agent: Perplexity-User Allow: / User-agent: Gemini Allow: / User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: Meta-ExternalAgent Disallow: / User-agent: cohere-ai Disallow: /
Seven dangerous misconfigurations
These patterns show up often in audits — use them as a manual checklist alongside any automated scan.
1. Blanket site block
User-agent: * Disallow: /
Staging leftovers in production make you invisible to every crawler, including AI.
2. Wildcard blocks on high-value paths
User-agent: * Disallow: /blog/ Disallow: /docs/
Without per-agent overrides, unlisted AI bots inherit this — and your best explanatory content never enters the retrieval stack.
3. Confusing GPTBot and ChatGPT-User
GPTBot and ChatGPT-User are different. Blocking one does not block the other. Address both if you need full control.
4. Blocking /api/ globally
User-agent: * Disallow: /api/
Structured JSON feeds sometimes live under /api/. A blind block can remove the cleanest machine-readable signal you publish.
5. Path case sensitivity
User-agent names are matched case-insensitively in common parsers, but URL paths are case-sensitive on many servers. Disallow: /Admin/ may not match /admin/.
6. Crawl-delay everywhere
User-agent: * Crawl-delay: 10
Broad delays can throttle bots that need many URLs; use delays surgically for overloaded origins, not as default policy. (Note: Googlebot ignores crawl-delay; other bots may not.)
7. Missing Sitemap line
A Sitemap: URL speeds discovery for search and many AI crawlers. Pair it with a reachable XML file.
What LLMsRadar checks today
When we scan your site, we fetch /robots.txt, parse it for the crawl queue, and run a readiness signal: whether the file explicitly mentions major AI user-agents such as GPTBot, ClaudeBot, or Google-Extended. If not, your score reflects reduced confidence in AI-bot configuration and we recommend adding clear rules — aligned with the patterns above.
Deeper issues (wildcard traps, crawl-delay, path casing, conflicts between Allow and Disallow) still deserve a human pass using the checklist in this article and vendor documentation.
Testing before you ship
- Use Search Console's robots.txt tester (or equivalent) with custom user-agents where supported.
- Fetch the live file:
curl https://yourdomain.com/robots.txt - Spot-check a key URL with a bot UA:
curl -A "GPTBot" -I https://yourdomain.com/pricing/
The right mental model
Every crawler you do not address explicitly operates in a gray zone. Gray zone behavior is unpredictable — and so is your AI visibility. Prefer a guest list: name agents, set paths, then use the wildcard for everyone else.
Scan your site with LLMsRadar →
Related: What is llms.txt · llms.txt implementation guide