
robots.txt Generator: Create the Right File for Your Site in 2 Minutes
π· Prateek Katyal / Pexelsrobots.txt Generator: Create the Right File for Your Site in 2 Minutes
A practical guide to robots.txt β what it does, common mistakes developers make, and how to generate the right file for your website.
robots.txt is one of those files that gets created once and then forgotten. For small content sites that's usually fine. For sites with admin panels, staging paths left exposed, internal search result pages, or duplicate content, a misconfigured (or absent) robots.txt can quietly cause SEO problems for months.
This guide covers what robots.txt actually does, what it doesn't do (a common source of confusion), and how to write a correct file for common website types. You can use the robots.txt generator to build yours without memorizing syntax.
What robots.txt Actually Does
robots.txt is a plain text file you put at the root of your domain that tells web crawlers which URLs they should and shouldn't crawl. The operative word is "crawl" β not "index."
The file lives at https://yourdomain.com/robots.txt. When Googlebot, Bingbot, or any other crawler visits your site, the first thing they do is fetch that file. Based on what they find, they decide which URLs to follow.
A minimal robots.txt looks like this:
User-agent: *
Allow: /
That tells all crawlers they can crawl everything. Functionally identical to having no robots.txt at all.
A restrictive one might look like:
User-agent: *
Disallow: /admin/
Disallow: /internal/
Disallow: /api/
Every directive is a Disallow or Allow line, applied to one or more User-agent groups.
The Crawling vs. Indexing Distinction (This Matters)
A lot of developers assume that blocking a URL in robots.txt removes it from Google's search results. It doesn't, and this causes real problems.
Here's what actually happens when you block a URL:
- Googlebot stops crawling it β it won't follow links from that page, won't read its content
- But if other indexed pages link to that blocked URL, Google can still discover the URL exists
- Google may still show that URL in search results with a generic description like "No information available for this page"
So if you have a page you want completely excluded from search results, Disallow alone is not enough. You need a noindex directive β but here's the catch: if the page is blocked from crawling, Googlebot can't read the noindex tag to honor it.
This creates a frustrating situation: to use noindex, you have to let the page be crawled.
The rule of thumb:
- Use robots.txt
Disallowto prevent crawlers from wasting crawl budget on pages that don't matter for SEO (staging paths, internal APIs, duplicate filtered pages) - Use the
noindexmeta tag to prevent specific pages from appearing in search results - For truly sensitive pages (admin panels, private content), use actual authentication β not robots.txt
User-Agent: Targeting Specific Bots
The User-agent field lets you target all crawlers or specific ones.
User-agent: *
The asterisk means "every crawler not otherwise specified." Most robots.txt files use this for general rules.
You can also target named bots:
User-agent: Googlebot
Disallow: /no-google/
User-agent: Bingbot
Disallow: /no-bing/
User-agent: *
Disallow: /admin/
Rules are applied per user-agent group. Googlebot follows the Googlebot block. If there's no matching block, it falls back to the * block.
Some crawlers you might specifically want to manage:
Googlebotβ Google's main web crawlerGooglebot-Imageβ Google's image crawler specificallyBingbotβ Microsoft BingGPTBotβ OpenAI's training crawler (a newer concern for content sites)anthropic-aiβ Anthropic's training crawlerAhrefsBot,SemrushBot,MJ12botβ SEO and analytics tools
If you want to block AI training crawlers specifically:
User-agent: GPTBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: *
Allow: /
This is increasingly common. Whether it actually matters for your site is a separate question, but the syntax is correct.
Common Disallow Patterns
Here are patterns that are correct for common situations:
Block an entire directory
User-agent: *
Disallow: /admin/
The trailing slash matters. /admin/ blocks /admin/settings, /admin/users, etc. Without the slash, /admin would also block a page literally called /administrator or /admin-login β probably not what you want.
Block a specific file
User-agent: *
Disallow: /private-page.html
Block URL query parameters (filter pages)
E-commerce sites often have the same products accessible via many filter combinations: /products?color=red&size=L&sort=price. These create duplicate content. Block them:
User-agent: *
Disallow: /*?
This blocks any URL with a query string. Be careful β if your important pages use query strings (some SPAs do), this will block them too. A more targeted approach:
User-agent: *
Disallow: /products/*?
This blocks query-string URLs only under /products/.
Allow a specific path within a blocked directory
Sometimes you want to block most of a directory but allow one path:
User-agent: *
Disallow: /account/
Allow: /account/signup
Order matters within a user-agent block. Most crawlers apply the most specific matching rule.
The Sitemap Directive
You can (and should) include your sitemap URL in robots.txt:
User-agent: *
Disallow: /admin/
Sitemap: https://yourdomain.com/sitemap.xml
The Sitemap directive goes at the end, outside any user-agent block. It's not just a courtesy β Google actively uses this to find and prioritize crawling your important pages.
If you have multiple sitemaps (common for large sites with image or video sitemaps):
Sitemap: https://yourdomain.com/sitemap.xml
Sitemap: https://yourdomain.com/sitemap-images.xml
You should also submit your sitemap directly in Google Search Console, but including it in robots.txt ensures any crawler that reads the file discovers it automatically.
Crawl-Delay: Use With Caution
Some crawlers support a Crawl-delay directive:
User-agent: *
Crawl-delay: 10
This tells bots to wait 10 seconds between requests. It was originally used to prevent crawlers from hammering small servers.
The problem: Googlebot ignores it. Google has said explicitly that they don't support Crawl-delay. If you need to reduce Googlebot's crawl rate, you do that inside Google Search Console under Crawl Rate Settings.
Other bots like Bingbot do support Crawl-delay, so it's not completely useless β but for Google specifically, it's a no-op. Don't rely on it for rate limiting.
Common Mistakes That Hurt SEO
Blocking CSS and JavaScript files
This used to be common advice β "block bots from crawling scripts and stylesheets to save bandwidth." It's bad practice. Google needs to render your pages to properly evaluate them, and rendering requires CSS and JavaScript. If you block these:
# Don't do this
User-agent: *
Disallow: /wp-content/
Disallow: /assets/
Disallow: /*.css$
Disallow: /*.js$
Google can't see what your pages actually look like, which can hurt rankings. Allow crawlers to access your static assets.
Forgetting about staging and development environments
If your staging environment is accessible at staging.yourdomain.com or yourdomain.com/staging/, make sure it's either:
- Behind authentication (preferred)
- Blocked in robots.txt
Staging content indexed by Google creates duplicate content issues. This is a common oversight on sites that move fast.
Using robots.txt to protect sensitive content
This one is worth repeating. robots.txt is not access control. The file is publicly readable β anyone can visit yourdomain.com/robots.txt and see exactly which paths you've tried to hide. If those paths are interesting, people will visit them directly. Use server-side authentication for anything that's actually sensitive.
Blocking your own sitemap accidentally
# This accidentally blocks the sitemap
User-agent: *
Disallow: /
Allow: /public/
Sitemap: https://yourdomain.com/sitemap.xml
In this example, /sitemap.xml is blocked by Disallow: / because it doesn't match the Allow: /public/ exception. The sitemap directive in the file doesn't automatically unblock the sitemap URL. You'd need to add Allow: /sitemap.xml explicitly.
Wrong location
robots.txt must live at the domain root. yourdomain.com/robots.txt works. yourdomain.com/blog/robots.txt does nothing β crawlers won't look there.
Also, a robots.txt at yourdomain.com does not affect subdomain.yourdomain.com. Each subdomain needs its own robots.txt if you want to control crawling there.
robots.txt Examples for Common Site Types
Standard content or marketing site
User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
If there's nothing to hide and no duplicate content problem, keep it simple.
WordPress site
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://yourdomain.com/sitemap.xml
The Allow: /wp-admin/admin-ajax.php is important β some themes and plugins use this endpoint for front-end functionality that Googlebot needs to access for correct rendering.
E-commerce site with filter pages
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /search?
Disallow: /products/*?sort=
Disallow: /products/*?page=
Sitemap: https://yourdomain.com/sitemap.xml
Cart and checkout pages should never be indexed. Account pages are user-specific content. Filter/sort URL combinations create duplicate content.
Next.js or SPA with API routes
User-agent: *
Disallow: /api/
Disallow: /_next/
Allow: /_next/static/
Allow: /_next/image
Sitemap: https://yourdomain.com/sitemap.xml
Block API routes. Block Next.js internal paths except for the static assets and image optimization endpoint, which renderers need.
Testing Your robots.txt
Don't assume your file is correct β verify it.
Google Search Console has a robots.txt tester (Settings > robots.txt). You enter a URL path and it tells you whether Googlebot is allowed or blocked based on your current file. It also shows you what Google last fetched the file as.
URL Inspection tool in Search Console lets you check individual URLs and see whether Google can access and render them.
Third-party tools like the robots.txt check at various SEO suites can validate syntax and test multiple user agents. The SEO meta generator and OG image generator are useful companions when you're working on your site's overall SEO setup.
After making changes to robots.txt, you can request Google re-fetch it using the URL Inspection tool β though changes typically propagate within a few days anyway.
Generating Your robots.txt
Writing robots.txt by hand isn't complicated, but it's easy to get the syntax slightly wrong β a missing slash, a directive in the wrong place, a user-agent block that doesn't cover what you thought.
The robots.txt generator lets you select which bots to configure, which paths to block, add your sitemap URL, and get a clean, valid output. It's faster than looking up the syntax, and it reduces the chance of a typo that costs you crawl coverage on pages you care about.
Once generated, place the file at the root of your web server and verify it's accessible at yourdomain.com/robots.txt. For most frameworks:
- Next.js: Add
public/robots.txtor use therobots.tsroute in the App Router - Gatsby: Place in
static/robots.txt - Hugo: Place in
static/robots.txt - Plain HTML sites: Place in the root directory served by your web server
For Next.js App Router users, you can also generate it programmatically:
// app/robots.ts
import { MetadataRoute } from 'next'
export default function robots(): MetadataRoute.Robots {
return {
rules: {
userAgent: '*',
allow: '/',
disallow: ['/admin/', '/api/'],
},
sitemap: 'https://yourdomain.com/sitemap.xml',
}
}
This approach is preferable for sites where the rules might depend on environment variables (e.g., blocking all crawlers on staging).
robots.txt is simple but the misconceptions around it β especially the crawling vs. indexing distinction β cause real, sometimes hard-to-debug SEO problems. Get the file right once, test it in Search Console, and then you can actually forget about it.