robots.txt
Complete guide to robots.txt: syntax, directives, per-AI-crawler rules, and best practices for 2026.
2025-02-15 · Updated 2026-02-20
The robots.txt file is a plain-text file placed at the root of your domain (https://example.com/robots.txt). It tells web crawlers — both traditional search engines and AI agents — which pages they are allowed to access, before they visit any URL.
Location and format
- URL:
https://yourdomain.com/robots.txt - Format: Plain text, UTF-8 encoded
- Case-sensitive: Paths are case-sensitive on Unix/Linux servers
Basic syntax
User-agent: GPTBot
Disallow: /private/
Allow: /
User-agent: *
Disallow: /admin/
User-agent— The crawler this rule block applies to.*matches all crawlers.Disallow— Paths the crawler must not access (recursive:/blog/blocks all URLs starting with/blog/).Allow— Explicitly permits a path, overriding a broaderDisallow.Crawl-delay: N— Requests an N-second pause between requests (not all crawlers support this).Sitemap: URL— Points to your sitemap URL.
Rules are evaluated top to bottom. For a given crawler, the most specific matching rule wins.
Wildcard patterns
Most crawlers support two wildcards:
*— matches any sequence of characters$— anchors to the end of the URL
# Block all PDF files
User-agent: *
Disallow: /*.pdf$
# Block URLs with ?session= parameter
User-agent: *
Disallow: /*?session=
AI crawler user-agents reference
| Crawler | Company | Slug |
|---|---|---|
GPTBot |
OpenAI | ChatGPT browsing, training |
OAI-SearchBot |
OpenAI | ChatGPT search |
ChatGPT-User |
OpenAI | ChatGPT user-triggered browsing |
ClaudeBot |
Anthropic | Claude training and browsing |
anthropic-ai |
Anthropic | Anthropic general |
PerplexityBot |
Perplexity AI | Perplexity search |
Google-Extended |
Gemini training (separate from Googlebot) | |
Applebot-Extended |
Apple | Apple Intelligence |
Meta-ExternalAgent |
Meta | Meta AI |
Bytespider |
ByteDance | TikTok / training |
CCBot |
Common Crawl | Open dataset (used by many AI orgs) |
Diffbot |
Diffbot | Data extraction for AI |
Blocking all AI crawlers
To block all major AI crawlers while keeping traditional search engines:
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
Note: Blocking Google-Extended has no impact on traditional Google Search (Googlebot). They are independent.
Allowing some, blocking others
Allow reputable AI agents while blocking data aggregators:
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
Blocking specific sections only
Allow crawling in general but protect members-only or transactional areas:
User-agent: GPTBot
Disallow: /members/
Disallow: /checkout/
Disallow: /api/private/
Allow: /
Complete recommended example
# Traditional search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# AI agents — allowed
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
# AI agents — blocked (training data aggregators)
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
# All other crawlers
User-agent: *
Allow: /
# Sitemap
Sitemap: https://example.com/sitemap.xml
Common mistakes
- Blocking Googlebot accidentally — Using
User-agent: *withDisallow: /blocks Google Search. Always add explicit rules for Googlebot if you use a wildcard block. - Forgetting Common Crawl (CCBot) — CCBot feeds open datasets used by many AI organizations for training. Block it if you want to opt out of open training corpora.
- Missing the Sitemap reference — Always add
Sitemap:at the end. - Thinking Disallow provides security —
robots.txtis public and is a convention, not a technical barrier. Malicious bots ignore it. Use proper authentication for sensitive content. - Blocking CSS and JS — This prevents crawlers from fully rendering your pages.
Does blocking AI crawlers affect SEO?
No. Google-Extended, GPTBot, ClaudeBot are completely separate from Googlebot. Blocking them has zero impact on Google Search rankings.
Verification
- Visit
https://yourdomain.com/robots.txtdirectly — confirm it is accessible and correctly formatted. - Use Google Search Console's robots.txt tester.
- Check server logs after 24–48 hours to verify crawlers are respecting the rules.
Complement with Content Signals
robots.txt controls access (can you crawl this?). For expressing what AI systems can do with your content after accessing it, use Content Signals headers alongside robots.txt.