web spider - Everything2.com

A web spider, or web robot, is a computer program or set of programs that systematically looks at pages on the World Wide Web, and gathers information about them.

The web spider will look at a page, parse it, and gather any links to other URLs that appear on the page. It then adds the new URLs to its list of pages that it will crawl, or look at, in the future.

This is how search engines gather information about web pages. For example, you might submit the URL of your personal home page to a search engine. The search engine's web spider crawls your page, reading in its html text and putting information about it into a database.

If your page has a link to a different page, the web spider stores the URL for that link in its database, and crawls the new page. The process keeps going for as long as the spider programs keep running.

Web spiders sometimes hit a web server too hard, meaning that the spider tries to crawl a lot of the web pages on that server. Too many hits on a server can slow it down or crash it. A well behaved spider won't do this.

To stop this from happening, the idea of a robots.txt file was developed. A nice spider checks the robots.txt file on a web server before it crawls the site, to see what folders or files it should avoid crawling. Some spiders aren't nice, so you can't count on a robots.txt file to ensure your privacy.

robots.txt	web crawler	teaching spiders to write Tetris in Javascript	MSN Search
brown recluse	Google	Kerrang!	world wide web
How to get Apache to tell your visitors when files have moved or been deleted	Multipede	Mega Man X4	address harvester
Link and Link	Arachnid	static	smithereens
crawler	search engine	spider	Natalie Portman