How to Use the robots.txt File to Exclude Files and Directories from Search Engines
The robots.txt file, also known as the robots exclusion file, is probably the best way to protect specific files and directories from being indexed by search engines. Why would you want to do this? To prevent certain Web pages, images or other Web objects from being added to a search engine’s database.
Every Web site should have a robots.txt file installed in the root directory. Why? Because it is the first file a legitimate search engine requests before it begins to index a site. A properly configured robots.txt file helps a search engine spider index a site more efficiently by telling the spider which directories and files it should ignore.
The robots exclusion file will only block legitimate spiders who recognize and obey the rules placed in the file. Recognition is completely voluntary, but all legitimate spiders do try to respect the file. This means the robots.txt file will do nothing to prevent e-mail harvesters or other illegitimate spiders that may be searching for vulnerabilities or specific content. Also, even legitimate spiders sometimes ignore the robots.txt, but it nonetheless is the most reliable
way to prevent the indexing of images and documents you wish to protect.
If not used correctly, the robots.txt file can be a double-edged sword. Because it is a plain text file, it can be viewed by anyone on the Web. Just simply enter the domain name and the name of the file, as in:
http://www.domainname.com/robots.txt
This means that if you use a hidden directory that you do not want people to know about, it would not be wise to place that directory in the robots.txt file.
The robots.txt file is a plain text file. Never use a Word processor, such as Word, or an HTML editor that may save a file with formatting information or special codes. Use only a plain text editor, such as Microsoft’s Notepad. If you do not save the file as a plain text file, it may not be readable by search engine spiders. also, the name of the file is robots.txt (all lower case). If you accidentally save it as Robots.txt (capitalized) or robot.txt (singular) it will not be read. The file must be saved in a site’s root directory.
Here is the basic syntax.
To prevent all search engine spiders from searching a site, use the following:
User-agent: *
Disallow: /
The wildcard (*) invites all search engine spiders. The next line disallows indexing starting at the root directory. This is useful when you want to block an entire site from being indexed, such as a site that you frequently use to develop and test new web sites. Be careful how you
use it. If you apply this rule to an existing site you could wipe out all of the site’s search engine listings.
You can also block specific spiders from accessing any or all areas of a site using the spider’s User-Agent. The following blocks Google’s Googlebot spider from accessing a site.
User-agent: Googlebot
Disallow: /
This is the most common format for the robots.txt file:
User-agent: *
Disallow: /images/
These rules invite all spiders, but tell them not to index anything in the /images/ directory. You can also tell the spiders to exclude a series of directories by excluding the trailing slash. If you have several image directories, such as /images-people/, /images-mountains/, etc., you can exclude all subdirectories that start with the name "images" like this:
User-agent: *
Disallow: /images/
Individual files can be excluded by specifying the path to the files:
User-agent: *
Disallow: /pdfs/accountsignup.pdf
To exclude all the files in the pdfs subdirectory use:
User-agent: *
Disallow: /pdfs/
To learn more about the robots.txt file and its features, visit robotstxt.org