What is a Robots.txt File?
A robots.txt file is a simple text document that lives in the root directory of your website (e.g., yourdomain.com/robots.txt). It is the very first thing that search engine crawlers like Googlebot look for when they visit your site. This file gives them instructions on which pages or sections they are allowed to scan (crawl) and which they should ignore.
It is a fundamental part of the Robots Exclusion Protocol (REP), a standard that helps manage crawler traffic on your website. While it is not mandatory, having a well-structured robots.txt is a best practice for technical SEO.
Why You Need a Robots.txt File
- Prevent Crawling of Private Areas: You can block crawlers from accessing your admin login page (`/wp-admin`), internal search results pages, or thank-you pages.
- Manage Crawl Budget: Large websites have a "Crawl Budget." If Google wastes time crawling thousands of useless pages (like filtered search results), it might miss your important product or blog pages.
- Prevent Duplicate Content: You can block versions of pages that are meant for print or have specific tracking parameters in the URL.
- Specify Sitemap Location: You can tell all bots where to find your sitemap, helping them discover your content faster.
The Key Directives Explained
A robots.txt file is made up of simple rules called "directives":
1. User-agent
This specifies which bot the rule applies to. You can target specific bots (e.g., `User-agent: Googlebot`) or use a wildcard for all bots (`User-agent: *`).
2. Disallow
This is the "Keep Out" sign. Anything after `Disallow:` will be ignored by the crawler.
Example: Disallow: /private/ will block access to the entire `/private/` folder.
3. Allow
This directive overrides a Disallow rule. It is used for specific exceptions.
Example: If you disallow `/images/` but want to allow one specific image, you can write: Disallow: /images/Allow: /images/logo.png
4. Sitemap
This tells the crawler the full URL of your `sitemap.xml` file. It is a highly recommended best practice.
Important Warning: Robots.txt is Not a Security Tool
Blocking a folder in robots.txt does not make it private. Malicious bots and hackers completely ignore robots.txt. They will scan your `/private/` folder anyway. It is also important to note that even if a page is disallowed, Google may still index it (without crawling the content) if it finds links to it from other websites.
To truly secure a page, you must use password protection (`.htaccess`) or server-level authentication.
How to Use This Generator
- Set the Default Policy: For most live sites, you want to "Allow All" by default. If your site is under development, you might "Disallow All."
- Add your Sitemap: Paste the full URL to your `sitemap.xml` file.
- Add Disallow Rules: Click "+ Add Folder" to specify any directories you want to hide (e.g., `/admin/`, `/cart/`, `/search.php`).
- Copy & Upload: Copy the generated code and save it as a file named `robots.txt` in the root (`public_html`) folder of your website.
Frequently Asked Questions (FAQ)
How can I test my robots.txt file?
Google provides a free robots.txt Tester inside Google Search Console. It allows you to check for syntax errors and test if a specific URL is blocked for Googlebot.
Does this block AI bots like GPTBot?
Yes. The `User-agent: *` wildcard applies to all standard bots, including those from Google, Bing, OpenAI (GPTBot), and Common Crawl (CCBot). If you wanted to block only AI bots, you could add a specific rule: User-agent: GPTBotDisallow: /