AnalyticaHouse
Deleted User

Analytica House

Sep 4, 2022
4 min read

What is Robots.txt and How to Create and Use It?

What is Robots.txt and How to Create and Use It?

When search engine bots visit a website, they use the robots.txt file to control crawling and indexing. Also known as the Robots Exclusion Standard, robots.txt tells crawlers which files, folders, or URLs on your web server they may or may not access.

You may have heard many misconceptions about how to use robots.txt. In reality, it simply tells visiting bots which URLs on your site they should crawl. It’s used primarily to reduce request load and optimize crawl budget. It is not a way to prevent pages from appearing in search results—that requires a <meta name="robots" content="noindex"> tag or authentication barrier.

What Is Robots.txt?

robots.txt is a plain-text file placed in your site’s root directory that gives crawlers directives about which URLs (HTTP 200) they may or may not crawl.

What Is Robots.txt? How to Create and Use It?

Bots generally obey these directives. Pages disallowed in robots.txt won’t be crawled, though if those URLs are linked elsewhere, Google may still crawl them.

SEO Tip: If bots encounter a 5xx server error reading your robots.txt, they’ll assume something is wrong and stop crawling. That can make images behind a CDN disappear from Google’s view, for example.

Why Is Robots.txt Important for SEO?

Before crawling your sitemap URLs, bots first fetch your robots.txt. Any incorrect directive can lead to important pages being skipped. A temporary misconfiguration shouldn’t be irreversible—but fix it quickly to avoid lasting harm.

What Is Robots.txt? How to Create and Use It?

For instance, if you accidentally disallow a key category page, it won’t be crawled until you remove the directive. Bots cache your robots.txt for 24 hours, so changes take up to a day to take effect.

Where to Find Robots.txt

Place your robots.txt in your site’s root directory (e.g. example.com/robots.txt). Crawlers universally look for it there—never move it.

Creating Robots.txt

You can hand-edit robots.txt with any text editor or generate it via an online tool. Then upload it to your site’s root.

Manual Creation

Open a plain‐text editor and enter directives such as:

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

Save as robots.txt and upload to your root directory.

Recommended Directives

Key robots.txt commands:

  • User-agent: Selects which crawler a rule applies to.
  • Allow: Grants crawling permission.
  • Disallow: Blocks crawling of specified paths.
  • Sitemap: Points crawlers to your sitemap URL.

User-agent

Specifies which bot follows the following rules. Common bots include:

  • Googlebot
  • Bingbot
  • YandexBot
  • DuckDuckBot
  • Baiduspider
  • …and many more.

Example: Block only Googlebot from a thank-you page:

User-agent: Googlebot
Disallow: /thank-you

Allow & Disallow

Allow: permits crawling. Without any directives, the default is “allow all.”

Disallow: forbids crawling of the specified path.

Examples:

  • Allow all: User-agent: *
    Allow: /
  • Block all: User-agent: *
    Disallow: /
  • Block a folder but allow one subpage:
    User-agent: *
    Disallow: /private/
    Allow: /private/public-info
    

Testing with Google’s Robots.txt Tester

In Google Search Console, under Index > Coverage, you’ll see any robots.txt-related errors. You can also use the Robots.txt Tester to simulate how Googlebot handles specific URLs.

Robots.txt Tester

Common GSC Warnings

  • Blocked by robots.txt: URL is disallowed.
  • Indexed though blocked by robots.txt: Page is in the index despite being disallowed—use noindex or remove links.

Best Practices & Reminders

  • Bots fetch robots.txt before crawling any page.
  • Use Disallow: to prevent low-value pages from being crawled and wasting budget.
  • Include your sitemap with Sitemap:.
  • Keep robots.txt under 500 KiB—Google only reads up to that size.
  • Test for server errors—5xx responses cause bots to stop crawling.
  • Respect case sensitivity in URL paths.

Conclusion

robots.txt is a simple yet critical file for guiding crawlers and optimizing your crawl budget. Ensure it’s correct, keep it at your root, and test any changes promptly.

More resources