The standard for robots.txt was finalised on the robots mailing list on 30 June 1994, and is not backed by any standards body. It has become common practise for well behaved robots to follow it.

Why that filename?

That filename was choosen to fit the naming restrictions of all common OSs, it did not require any extra configuration on a web server, it easily indicated the purpose and it was unlikely to clash with any existing files.

File Format

The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form ":". The field name is case insensitive. Comments can be embedded using the Bourne shell convention of a # character.

Allowed fields

User-agent
This indicates the name of the robot the rule is applicable to. If more than one user agent is present the rule applies to all. It is recommended that robots are not case sensitive when checking this field.
If you wish to set a default rule, or access policy you may use the value "*".

Disallow
This field specifies a partial URL that a robot may not visit. This can be a full or partial path.

Unfortunately there is no way of specifing particular directories to spider.

Examples

To tell no robots to crawl your site

# robots.txt for no crawling
User-agent: *
Disallow: /

To stop any robot from hitting cgi-bin and tmp

# Stop looking where you're not supposed to
User-agent: *
Disallow: /cgi-bin
Disallow: /tmp

To allow a particular robot to wander, but stop every other robot.

# Stop looking where you're not supposed to User-agent: *
Disallow: /randomCapslockChatterbox
User-agent: thefez
Disallow:

Log in or register to write something here or to contact authors.