What is a robots.txt File?

A robots.txt file is a plain-text file placed at the root of your website that instructs web crawlers (also called bots or spiders) about which pages or sections they are allowed to access. It's part of the Robots Exclusion Standard (RES), a widely respected convention followed by search engines and most well-behaved bots.

How robots.txt Syntax Works

A robots.txt file consists of one or more rule blocks. Each block starts with a User-agent: directive that specifies which bot the rules apply to. User-agent: * matches all bots. Below the user-agent, you list Allow: and Disallow: directives specifying URL paths. An empty Disallow: means allow everything; Disallow: / means block everything.

Blocking AI Crawlers

A growing number of AI companies operate web crawlers to gather training data for large language models. These include GPTBot (OpenAI), CCBot (Common Crawl), Google-Extended (Google AI training), Amazonbot, Anthropic's Claude crawler, and others. You can block these specifically using their User-Agent strings without affecting Google's standard search crawler.

Crawl-Delay Directive

The Crawl-delay: directive tells bots to wait a specified number of seconds between requests. This can protect your server from being overwhelmed by aggressive crawlers. Note that Google does not respect Crawl-delay; use Google Search Console instead to control Googlebot's crawl rate.

Sitemap Declaration

The Sitemap: directive in robots.txt points crawlers directly to your XML sitemap. This helps search engines discover your content structure and index your pages more efficiently. You can include multiple Sitemap declarations — one per line.

Common Mistakes to Avoid

Robots.txt is not a security mechanism — it prevents crawling, not access. Sensitive pages must be protected by authentication or server-side access controls. Never use robots.txt to hide pages you don't want in Google's index; use a noindex meta tag instead. Disallowing a page in robots.txt can actually prevent Google from seeing the noindex tag, keeping the URL in the index without content.