Robots.txt Test
What is it?
The robots.txt file is a small plain-text file that lives at the root of your domain (typically https://example.com/robots.txt) and tells crawlers which paths on your site they may or may not access. It is the first resource most well-behaved crawlers request when they discover a new site, and the directives it contains shape how search engines, advertising bots, and AI training crawlers interact with your content.
Why robots.txt still matters in 2026
Robots.txt is your front door to the bot ecosystem. A well-tuned file keeps administrative pages, internal search results, and other low-value URLs out of search engine crawl budgets, ensuring that crawlers spend their time on the pages that actually deserve to be indexed. A misconfigured file, on the other hand, can quietly block your most important content from being crawled at all, often without anyone noticing until traffic drops weeks later.
The file also doubles as a public discovery mechanism for your XML sitemap. Adding a Sitemap: line to robots.txt ensures every crawler that reads the file also knows where to find your full URL inventory, which speeds up discovery of new and updated pages.
The newer reason: AI crawlers
The same robots.txt now governs the AI training and answer-engine crawlers operated by OpenAI (GPTBot), Anthropic (ClaudeBot), Perplexity (PerplexityBot), Google (Google-Extended for Bard and Gemini training), and others. These bots respect robots.txt directives, so the file is now the central place where you decide whether your content is eligible for inclusion in AI training datasets and citation in generative answers. Allowing them keeps your site visible in AI-driven search experiences; blocking them protects content you do not want surfaced or paraphrased without attribution.
Common mistakes worth checking
- Missing file, leaving every crawler to apply its own defaults.
- File returning a 5xx error, which causes some crawlers to stop crawling the entire site until the file is reachable again.
- Accidental site-wide block from a stray
Disallow: /left over after a staging deployment. - Blocking CSS or JavaScript, which prevents Google from rendering the page as users see it.
This test verifies that your site exposes a valid robots.txt at the root. The fix guide below walks through creating and editing the file directly, through the major content management systems, and at the CDN edge.
Pass rate:
-
Top 100 websites: 99%This value indicates the percent of top 100 most visited websites in the US that pass this test (in the past 12 months).
-
All websites: 86%This value indicates the percent of all websites analyzed in SEO Site Checkup (500,000+) in the past 12 months.
| 2021 | 94% |
|---|---|
| 2022 | 99% |
| 2023 | 99% |
| 2024 | 99% |
100
75
50
25
0
How do I fix it?
The robots.txt file is the first resource most crawlers request and tells them which paths on your site they may or may not access. Fixing this issue means publishing a valid robots.txt at the site root so search engines, training crawlers, and answer-engine bots all have an explicit policy to follow. Without it, every crawler defaults to its own behavior, which can over-expose private sections or waste crawl budget on low-value URLs.
Example
User-agent: *
Disallow: /admin/
Disallow: /cart
Allow: /
Sitemap: https://example.com/sitemap.xml
Where to make the change
- Raw HTML or static site: create a plain text file named
robots.txtand upload it to the document root so it is served athttps://example.com/robots.txt. - WordPress: WordPress generates a virtual robots.txt by default. Override it with a physical file in the site root, or use your SEO plugin's robots.txt editor.
- Shopify: Shopify auto-generates robots.txt and exposes a
robots.txt.liquidtemplate you can edit from the theme code editor to add or remove rules. - Wix or Squarespace: both platforms generate robots.txt automatically and offer limited overrides through the SEO settings panel.
- Headless or framework sites: add
robots.txtto the project's public or static directory so it ships with the build.
Common causes and how to resolve them
- File is missing entirely: create one. Even an empty file with
User-agent: *and no rules is preferable to a 404. - File returns a 5xx error: some crawlers stop crawling the entire site when robots.txt is unreachable. Investigate server or CDN issues that block the path.
- Accidentally blocking the whole site: a stray
Disallow: /underUser-agent: *hides every URL from search. Audit the file before deploying changes from staging. - Wrong location: robots.txt must sit at the site root, not in a subdirectory. Crawlers do not look elsewhere.
Best practices
- Reference the sitemap: add a
Sitemap:line so any crawler that fetches robots.txt also discovers your XML sitemap. - Keep it simple: a short, well-commented file is easier to audit than a sprawling one full of legacy rules. Remove directives whose original purpose is no longer relevant.
- Do not rely on robots.txt for security: the file is public, so listing private paths in it advertises them. Block sensitive paths with authentication, not just
Disallow. - Manage AI crawlers explicitly: if you want to allow or block bots like
GPTBot,ClaudeBot, orPerplexityBot, add namedUser-agentsections for each. Defaulting to allow keeps your content eligible for citation in AI answer engines.