Web indexing robots are used by many search engines such as Google, Inktomi, AltaVista and others. These web indexing robots are also known as spiders. These spiders/robots are the tools used by engines to harvest data for their search engines. When you submit your website to the engines, you are effectively asking the search engines to send their web indexing robot to your website so that it can be crawled and added to their database
So why do i need a robots.txt file?
Web-Indexing Robots can be controlled as to which part of your site they index by installing a file simple text file called robots.txt in the root path of the server with explicit instructions on what the spider is and is not permitted to index on your website.
You can define which paths are off limits for spiders to visit an block off such . This is useful for such things as large directories of information, personal information, and parts of the website containing large amounts of recursive links, among others.
Now it is possible to include robots.txt indexing information directly in your meta tag directly in your meta tag, and in some cases this is preferable if only one page needs to be controlled. You can use a meta tag like this <meta name="robots" content="INDEX,FOLLOW"> to tell the robot it is ok to index this page and follow links it finds on this page. However, if you have whole directories and multiple pages you want to control the indexing of then you need a robots.txt file to ease the burden of managing this task.
How accurate does my robots.txt tag have to be?
You need the correct path of the files or directories that reflect the web viewable path of the server.
Example: many servers use htdocs as the web root, but the ftp root will be different. Your robots.txt tag should not include the htdocs directory in front of the file/directory because the htdocs folder is not viewable on the web...the files in the htdocs and are what need to be listed if you whish to control the spiders indexing of them.
Do I have to have a robots.txt file in order to have search engines index my site?
The short answer is no! A web indexing robot will crawl your site unless told not to. However lets go a little deeper than that. A good web indexing robot such as Googlebot or Slurp (Inktomi) are considered well behaved web spiders and will attempt to find your robots.txt file before it indexes your site. As well good robots will look at your meta tags file and check for the <meta name="robots" line in order to get instructions about what to index on that page. Now remember we said "good robots" ...there are bad ones too! These spiders may be as innocuous as a university project that has yet to include code for checking and obeying robots.txt or it could be a malicious email address harvesting robot that harvests email addresses from websites for spam purposes. So what more can be done to stop these spiders?
The advanced way in stopping malicious spiders that ignore or disobey your robots.txt file is to look at blocking users agents at the server level and even so far as blocking IP's etc where possible. A user agent is a signature that is attached to the robots (provided they added one) which can be used to identify the robot. When a page is requested from your web server, software such as IIS (windows server) or Apache (Linux/Unix) will store this user agent information in your log files which you can review and react accordingly.
Where does the robots.txt go?
Your robots.txt file is placed in the root directory. What does that mean? It means it should go in the same directory level as your home page (default.htm etc). You will know if you got it right if you can type in the following into your browser http://www.robotstxt.ca/robots.txt and see your robots.txt tag come up, naturally replace our URL with your URL. If your still confused you can use the free testing wizard at www.sitesubmit.ca
Advanced Robots.txt Information from w3.org.
This come from the larger HTML 4.01 specs. The article can be found here
B.4 Notes on helping search engines index your Web site
This section provides some simple suggestions that will make your documents more accessible to search engines.
Define the document language
In the global context of the Web it is important to know which human language a page was written in. This is discussed in the section on language information.
Specify language variants of this document
If you have prepared translations of this document into other languages, you should use the LINK element to reference these. This allows an indexing engine to offer users search results in the user's preferred language, regardless of how the query was written. For instance, the following links offer French and German alternatives to a search engine:
lang="fr" title="La vie souterraine">
lang="de" title="Das Leben im Untergrund">
Provide keywords and descriptions
Some indexing engines look for META elements that define a comma-separated list of keywords/phrases, or that give a short description. Search engines may present these keywords as the result of a search. The value of the name attribute sought by a search engine is not defined by this specification. Consider these examples,
<META name="description" content="Idyllic European vacations">
Indicate the beginning of a collection
Collections of word processing documents or presentations are frequently translated into collections of HTML documents. It is helpful for search results to reference the beginning of the collection in addition to the page hit by the search. You may help search engines by using the LINK element with rel="start" along with the title attribute, as in:
title="General Theory of Relativity">
Provide robots with indexing instructions
People may be surprised to find that their site has been indexed by an indexing robot and that the robot should not have been permitted to visit a sensitive part of the site. Many Web robots offer facilities for Web site administrators and content providers to limit what the robot does. This is achieved through two mechanisms: a "robots.txt" file and the META element in HTML documents, described below.
B.4.1 Search robots
The robots.txt file
When a Robot visits a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. If it can find this document, it will analyze its contents to see if it is allowed to retrieve the document. You can customize the robots.txt file to apply only to specific robots, and to disallow access to specific directories or files.
Here is a sample robots.txt file that prevents all robots from visiting the entire site
User-agent: * # applies to all robots
Disallow: / # disallow indexing of all pages
The Robot will simply look for a "/robots.txt" URI on your site, where a site is defined as a HTTP server running on a particular host and port number. Here are some sample locations for robots.txt:
There can only be a single "/robots.txt" on a site. Specifically, you should not put "robots.txt" files in user directories, because a robot will never look at them. If you want your users to be able to create their own "robots.txt", you will need to merge them all into a single "/robots.txt". If you don't want to do this your users might want to use the Robots META Tag instead.
Some tips: URI's are case-sensitive, and "/robots.txt" string must be all lower-case. Blank lines are not permitted within a single record in the "robots.txt" file.
There must be exactly one "User-agent" field per record. The robot should be liberal in interpreting this field. A case-insensitive substring match of the name without version information is recommended.
If the value is "*", the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.
The "Disallow" field specifies a partial URI that is not to be visited. This can be a full path, or a partial path; any URI that starts with this value will not be retrieved. For example,
Disallow: /help disallows both /help.html and /help/index.html, whereas
Disallow: /help/ would disallow /help/index.html but allow /help.html.
An empty value for "Disallow", indicates that all URIs can be retrieved. At least one "Disallow" field must be present in the robots.txt file.
Robots and the META element
The META element allows HTML authors to tell visiting robots whether a document may be indexed, or used to harvest more links. No server administrator action is required.
In the following example a robot should neither index this document, nor analyze it for links.
<META name="ROBOTS" content="NOINDEX, NOFOLLOW">
The list of terms in the content is ALL, INDEX, NOFOLLOW, NOINDEX.
Note. In early 1997 only a few robots implement this, but this is expected to change as more public attention is given to controlling indexing robots.
To allow all robots/spiders index
To exclude all robots/spiders from your website
To exclude all robots/spiders from part of your website
To exclude a single robot/spider