ToolPal
A robot figurine standing in front of a laptop keyboard

robots.txt Generator: Create the Right File for Your Site in 2 Minutes

πŸ“· Prateek Katyal / Pexels

robots.txt Generator: Create the Right File for Your Site in 2 Minutes

A practical guide to robots.txt β€” what it does, common mistakes developers make, and how to generate the right file for your website.

April 7, 202610 min read

robots.txt is one of those files that gets created once and then forgotten. For small content sites that's usually fine. For sites with admin panels, staging paths left exposed, internal search result pages, or duplicate content, a misconfigured (or absent) robots.txt can quietly cause SEO problems for months.

This guide covers what robots.txt actually does, what it doesn't do (a common source of confusion), and how to write a correct file for common website types. You can use the robots.txt generator to build yours without memorizing syntax.

What robots.txt Actually Does

robots.txt is a plain text file you put at the root of your domain that tells web crawlers which URLs they should and shouldn't crawl. The operative word is "crawl" β€” not "index."

The file lives at https://yourdomain.com/robots.txt. When Googlebot, Bingbot, or any other crawler visits your site, the first thing they do is fetch that file. Based on what they find, they decide which URLs to follow.

A minimal robots.txt looks like this:

User-agent: *
Allow: /

That tells all crawlers they can crawl everything. Functionally identical to having no robots.txt at all.

A restrictive one might look like:

User-agent: *
Disallow: /admin/
Disallow: /internal/
Disallow: /api/

Every directive is a Disallow or Allow line, applied to one or more User-agent groups.

The Crawling vs. Indexing Distinction (This Matters)

A lot of developers assume that blocking a URL in robots.txt removes it from Google's search results. It doesn't, and this causes real problems.

Here's what actually happens when you block a URL:

  • Googlebot stops crawling it β€” it won't follow links from that page, won't read its content
  • But if other indexed pages link to that blocked URL, Google can still discover the URL exists
  • Google may still show that URL in search results with a generic description like "No information available for this page"

So if you have a page you want completely excluded from search results, Disallow alone is not enough. You need a noindex directive β€” but here's the catch: if the page is blocked from crawling, Googlebot can't read the noindex tag to honor it.

This creates a frustrating situation: to use noindex, you have to let the page be crawled.

The rule of thumb:

  • Use robots.txt Disallow to prevent crawlers from wasting crawl budget on pages that don't matter for SEO (staging paths, internal APIs, duplicate filtered pages)
  • Use the noindex meta tag to prevent specific pages from appearing in search results
  • For truly sensitive pages (admin panels, private content), use actual authentication β€” not robots.txt

User-Agent: Targeting Specific Bots

The User-agent field lets you target all crawlers or specific ones.

User-agent: *

The asterisk means "every crawler not otherwise specified." Most robots.txt files use this for general rules.

You can also target named bots:

User-agent: Googlebot
Disallow: /no-google/

User-agent: Bingbot
Disallow: /no-bing/

User-agent: *
Disallow: /admin/

Rules are applied per user-agent group. Googlebot follows the Googlebot block. If there's no matching block, it falls back to the * block.

Some crawlers you might specifically want to manage:

  • Googlebot β€” Google's main web crawler
  • Googlebot-Image β€” Google's image crawler specifically
  • Bingbot β€” Microsoft Bing
  • GPTBot β€” OpenAI's training crawler (a newer concern for content sites)
  • anthropic-ai β€” Anthropic's training crawler
  • AhrefsBot, SemrushBot, MJ12bot β€” SEO and analytics tools

If you want to block AI training crawlers specifically:

User-agent: GPTBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: *
Allow: /

This is increasingly common. Whether it actually matters for your site is a separate question, but the syntax is correct.

Common Disallow Patterns

Here are patterns that are correct for common situations:

Block an entire directory

User-agent: *
Disallow: /admin/

The trailing slash matters. /admin/ blocks /admin/settings, /admin/users, etc. Without the slash, /admin would also block a page literally called /administrator or /admin-login β€” probably not what you want.

Block a specific file

User-agent: *
Disallow: /private-page.html

Block URL query parameters (filter pages)

E-commerce sites often have the same products accessible via many filter combinations: /products?color=red&size=L&sort=price. These create duplicate content. Block them:

User-agent: *
Disallow: /*?

This blocks any URL with a query string. Be careful β€” if your important pages use query strings (some SPAs do), this will block them too. A more targeted approach:

User-agent: *
Disallow: /products/*?

This blocks query-string URLs only under /products/.

Allow a specific path within a blocked directory

Sometimes you want to block most of a directory but allow one path:

User-agent: *
Disallow: /account/
Allow: /account/signup

Order matters within a user-agent block. Most crawlers apply the most specific matching rule.

The Sitemap Directive

You can (and should) include your sitemap URL in robots.txt:

User-agent: *
Disallow: /admin/

Sitemap: https://yourdomain.com/sitemap.xml

The Sitemap directive goes at the end, outside any user-agent block. It's not just a courtesy β€” Google actively uses this to find and prioritize crawling your important pages.

If you have multiple sitemaps (common for large sites with image or video sitemaps):

Sitemap: https://yourdomain.com/sitemap.xml
Sitemap: https://yourdomain.com/sitemap-images.xml

You should also submit your sitemap directly in Google Search Console, but including it in robots.txt ensures any crawler that reads the file discovers it automatically.

Crawl-Delay: Use With Caution

Some crawlers support a Crawl-delay directive:

User-agent: *
Crawl-delay: 10

This tells bots to wait 10 seconds between requests. It was originally used to prevent crawlers from hammering small servers.

The problem: Googlebot ignores it. Google has said explicitly that they don't support Crawl-delay. If you need to reduce Googlebot's crawl rate, you do that inside Google Search Console under Crawl Rate Settings.

Other bots like Bingbot do support Crawl-delay, so it's not completely useless β€” but for Google specifically, it's a no-op. Don't rely on it for rate limiting.

Common Mistakes That Hurt SEO

Blocking CSS and JavaScript files

This used to be common advice β€” "block bots from crawling scripts and stylesheets to save bandwidth." It's bad practice. Google needs to render your pages to properly evaluate them, and rendering requires CSS and JavaScript. If you block these:

# Don't do this
User-agent: *
Disallow: /wp-content/
Disallow: /assets/
Disallow: /*.css$
Disallow: /*.js$

Google can't see what your pages actually look like, which can hurt rankings. Allow crawlers to access your static assets.

Forgetting about staging and development environments

If your staging environment is accessible at staging.yourdomain.com or yourdomain.com/staging/, make sure it's either:

  1. Behind authentication (preferred)
  2. Blocked in robots.txt

Staging content indexed by Google creates duplicate content issues. This is a common oversight on sites that move fast.

Using robots.txt to protect sensitive content

This one is worth repeating. robots.txt is not access control. The file is publicly readable β€” anyone can visit yourdomain.com/robots.txt and see exactly which paths you've tried to hide. If those paths are interesting, people will visit them directly. Use server-side authentication for anything that's actually sensitive.

Blocking your own sitemap accidentally

# This accidentally blocks the sitemap
User-agent: *
Disallow: /
Allow: /public/

Sitemap: https://yourdomain.com/sitemap.xml

In this example, /sitemap.xml is blocked by Disallow: / because it doesn't match the Allow: /public/ exception. The sitemap directive in the file doesn't automatically unblock the sitemap URL. You'd need to add Allow: /sitemap.xml explicitly.

Wrong location

robots.txt must live at the domain root. yourdomain.com/robots.txt works. yourdomain.com/blog/robots.txt does nothing β€” crawlers won't look there.

Also, a robots.txt at yourdomain.com does not affect subdomain.yourdomain.com. Each subdomain needs its own robots.txt if you want to control crawling there.

robots.txt Examples for Common Site Types

Standard content or marketing site

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

If there's nothing to hide and no duplicate content problem, keep it simple.

WordPress site

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://yourdomain.com/sitemap.xml

The Allow: /wp-admin/admin-ajax.php is important β€” some themes and plugins use this endpoint for front-end functionality that Googlebot needs to access for correct rendering.

E-commerce site with filter pages

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /search?
Disallow: /products/*?sort=
Disallow: /products/*?page=

Sitemap: https://yourdomain.com/sitemap.xml

Cart and checkout pages should never be indexed. Account pages are user-specific content. Filter/sort URL combinations create duplicate content.

Next.js or SPA with API routes

User-agent: *
Disallow: /api/
Disallow: /_next/
Allow: /_next/static/
Allow: /_next/image

Sitemap: https://yourdomain.com/sitemap.xml

Block API routes. Block Next.js internal paths except for the static assets and image optimization endpoint, which renderers need.

Testing Your robots.txt

Don't assume your file is correct β€” verify it.

Google Search Console has a robots.txt tester (Settings > robots.txt). You enter a URL path and it tells you whether Googlebot is allowed or blocked based on your current file. It also shows you what Google last fetched the file as.

URL Inspection tool in Search Console lets you check individual URLs and see whether Google can access and render them.

Third-party tools like the robots.txt check at various SEO suites can validate syntax and test multiple user agents. The SEO meta generator and OG image generator are useful companions when you're working on your site's overall SEO setup.

After making changes to robots.txt, you can request Google re-fetch it using the URL Inspection tool β€” though changes typically propagate within a few days anyway.

Generating Your robots.txt

Writing robots.txt by hand isn't complicated, but it's easy to get the syntax slightly wrong β€” a missing slash, a directive in the wrong place, a user-agent block that doesn't cover what you thought.

The robots.txt generator lets you select which bots to configure, which paths to block, add your sitemap URL, and get a clean, valid output. It's faster than looking up the syntax, and it reduces the chance of a typo that costs you crawl coverage on pages you care about.

Once generated, place the file at the root of your web server and verify it's accessible at yourdomain.com/robots.txt. For most frameworks:

  • Next.js: Add public/robots.txt or use the robots.ts route in the App Router
  • Gatsby: Place in static/robots.txt
  • Hugo: Place in static/robots.txt
  • Plain HTML sites: Place in the root directory served by your web server

For Next.js App Router users, you can also generate it programmatically:

// app/robots.ts
import { MetadataRoute } from 'next'

export default function robots(): MetadataRoute.Robots {
  return {
    rules: {
      userAgent: '*',
      allow: '/',
      disallow: ['/admin/', '/api/'],
    },
    sitemap: 'https://yourdomain.com/sitemap.xml',
  }
}

This approach is preferable for sites where the rules might depend on environment variables (e.g., blocking all crawlers on staging).


robots.txt is simple but the misconceptions around it β€” especially the crawling vs. indexing distinction β€” cause real, sometimes hard-to-debug SEO problems. Get the file right once, test it in Search Console, and then you can actually forget about it.

Frequently Asked Questions

Share this article

XLinkedIn

Related Posts