robots.txt Rules for GPTBot, ClaudeBot, and PerplexityBot: Why Explicit Allow Rules Matter

Reading time: 10 min

Most sites were built when "crawler" meant Googlebot. robots.txt targeted search engines. In 2026 your domain is hit by many AI user-agents — different strings, different jobs, and different defaults when rules are vague or missing.

Block the wrong bot and you vanish from AI answers. Rely on implicit permission and you leave visibility to chance. This article maps the main AI crawlers, explains why explicit Allow lines help, and lists misconfigurations that quietly hurt AI readiness.


The new crawler landscape

The list below is a practical directory — always confirm the latest user-agent strings in each vendor's documentation, because they can change.

AI CRAWLER DIRECTORY (illustrative — verify with vendors)

Crawler           User-Agent           Company      Primary role
──────────────────────────────────────────────────────────────────
GPTBot            GPTBot               OpenAI       Training / corpus
ChatGPT-User      ChatGPT-User         OpenAI       Live browsing for answers
ClaudeBot         ClaudeBot            Anthropic    Training / indexing
Claude-Web        Claude-Web           Anthropic    Live browsing
PerplexityBot     PerplexityBot        Perplexity   Search / indexing
Perplexity-User   Perplexity-User      Perplexity   Live answers
Google-Extended   Google-Extended      Google       Gemini / AI Overviews training use
Gemini            Gemini               Google       Live access (verify naming)
Meta-ExternalAgent Meta-ExternalAgent  Meta         Meta AI
cohere-ai         cohere-ai            Cohere       Training
Applebot-Extended Applebot-Extended    Apple        Apple Intelligence
YouBot            YouBot               You.com      AI search

These agents generally respect robots.txt — but only when your rules are readable and unambiguous. They do not infer marketing intent. They match user-agent groups and path rules.

Why ambiguity is the real enemy

A legacy file might look fine to humans yet leave AI crawlers in a gray zone:

User-agent: *
Disallow: /admin/
Disallow: /checkout/

User-agent: Googlebot
Allow: /

What many AI crawlers inherit:
  └── Wildcard group applies unless a dedicated block exists.
      Paths like / may be allowed by omission — but there is no
      explicit "Allow: /" for GPTBot, ClaudeBot, etc.

Google-Extended does not use the Googlebot block — it falls through.

The gap between "not explicitly disallowed" and "explicitly allowed" matters for operations teams and for tooling that scores crawl clarity. The fix: name the agents you care about and give each a clear Allow or Disallow.

Two types of AI crawling: training vs real-time

TYPE 1 — TRAINING-LIKE CRAWLERS
  Examples: GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, …
  Purpose:  Build or refresh model-facing corpora (policy-dependent).
  If blocked:  Your site may be absent from those pipelines.
  If allowed:  Content may be eligible for inclusion per vendor policy.

TYPE 2 — REAL-TIME / RETRIEVAL
  Examples: ChatGPT-User, Claude-Web, PerplexityBot, Perplexity-User, …
  Purpose:  Fetch pages to answer a live user query.
  If blocked:  That assistant may not cite or quote you in real time.
  If allowed:  Pages can appear in answers when retrieved.

Many publishers allow retrieval but restrict training user-agents. That is only possible if those agents are listed separately — which is why copy-paste "block GPTBot" snippets are never a substitute for a thought-through matrix.

Recommended explicit structure for AI visibility

Template for yourdomain.com — adjust paths you truly need to block:

# --- Search (examples) ---
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# --- OpenAI ---
User-agent: ChatGPT-User
Allow: /

User-agent: GPTBot
Allow: /

# --- Anthropic ---
User-agent: Claude-Web
Allow: /

User-agent: ClaudeBot
Allow: /

# --- Perplexity ---
User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# --- Google AI ---
User-agent: Google-Extended
Allow: /

User-agent: Gemini
Allow: /

# --- Other AI (add/remove per policy) ---
User-agent: Meta-ExternalAgent
Allow: /

User-agent: cohere-ai
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: YouBot
Allow: /

# --- Default ---
User-agent: *
Disallow: /admin/
Disallow: /checkout/
Disallow: /user/account/
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Vendor user-agent names evolve — treat this as a starting point and reconcile with official docs before production.

Selective opt-out: retrieval on, training off

A common publisher pattern (intent — confirm compliance with each platform's terms):

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Gemini
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: cohere-ai
Disallow: /

Seven dangerous misconfigurations

These patterns show up often in audits — use them as a manual checklist alongside any automated scan.

1. Blanket site block

User-agent: *
Disallow: /

Staging leftovers in production make you invisible to every crawler, including AI.

2. Wildcard blocks on high-value paths

User-agent: *
Disallow: /blog/
Disallow: /docs/

Without per-agent overrides, unlisted AI bots inherit this — and your best explanatory content never enters the retrieval stack.

3. Confusing GPTBot and ChatGPT-User

GPTBot and ChatGPT-User are different. Blocking one does not block the other. Address both if you need full control.

4. Blocking /api/ globally

User-agent: *
Disallow: /api/

Structured JSON feeds sometimes live under /api/. A blind block can remove the cleanest machine-readable signal you publish.

5. Path case sensitivity

User-agent names are matched case-insensitively in common parsers, but URL paths are case-sensitive on many servers. Disallow: /Admin/ may not match /admin/.

6. Crawl-delay everywhere

User-agent: *
Crawl-delay: 10

Broad delays can throttle bots that need many URLs; use delays surgically for overloaded origins, not as default policy. (Note: Googlebot ignores crawl-delay; other bots may not.)

7. Missing Sitemap line

A Sitemap: URL speeds discovery for search and many AI crawlers. Pair it with a reachable XML file.

What LLMsRadar checks today

When we scan your site, we fetch /robots.txt, parse it for the crawl queue, and run a readiness signal: whether the file explicitly mentions major AI user-agents such as GPTBot, ClaudeBot, or Google-Extended. If not, your score reflects reduced confidence in AI-bot configuration and we recommend adding clear rules — aligned with the patterns above.

Deeper issues (wildcard traps, crawl-delay, path casing, conflicts between Allow and Disallow) still deserve a human pass using the checklist in this article and vendor documentation.

Testing before you ship

  • Use Search Console's robots.txt tester (or equivalent) with custom user-agents where supported.
  • Fetch the live file: curl https://yourdomain.com/robots.txt
  • Spot-check a key URL with a bot UA: curl -A "GPTBot" -I https://yourdomain.com/pricing/

The right mental model

Every crawler you do not address explicitly operates in a gray zone. Gray zone behavior is unpredictable — and so is your AI visibility. Prefer a guest list: name agents, set paths, then use the wildcard for everyone else.

Scan your site with LLMsRadar →

Related: What is llms.txt · llms.txt implementation guide

Tags: robots.txt, GPTBot, ClaudeBot, PerplexityBot, AI crawlers, AI SEO, LLM visibility, AI readiness 2026

← Back to blog