A web spider, or web robot, is a computer program or set of programs that systematically looks at pages on the World Wide Web, and gathers information about them.

The web spider will look at a page, parse it, and gather any links to other URLs that appear on the page. It then adds the new URLs to its list of pages that it will crawl, or look at, in the future.

This is how search engines gather information about web pages. For example, you might submit the URL of your personal home page to a search engine. The search engine's web spider crawls your page, reading in its html text and putting information about it into a database.

If your page has a link to a different page, the web spider stores the URL for that link in its database, and crawls the new page. The process keeps going for as long as the spider programs keep running.

Web spiders sometimes hit a web server too hard, meaning that the spider tries to crawl a lot of the web pages on that server. Too many hits on a server can slow it down or crash it. A well behaved spider won't do this.

To stop this from happening, the idea of a robots.txt file was developed. A nice spider checks the robots.txt file on a web server before it crawls the site, to see what folders or files it should avoid crawling. Some spiders aren't nice, so you can't count on a robots.txt file to ensure your privacy.

Log in or register to write something here or to contact authors.