11/96 Features: The Search Is On (Web Crawlers)

Pest Control For Web Crawlers

YOU CHECK YOUR Web-server log one morning, and find unusually heavy traffic beginning at your home page and continuing through its linked pages in a pattern. Each page hit has been downloaded at regularly spaced intervals. A scan of your log's request header reveals the word "scooter."

You've been crawled.

It sounds creepy, but getting crawled is good. It means a search engine's spider was checking out your site to update its catalog of Web documents. Search engines use autonomous agents that link to one URL, collect its contents, then use that page's links to find others. These agents-also known as spiders, bots, crawlers or wanderers-count, parse and index Web documents of a certain size or type. The best leave small calling cards, like AltaVista's "scooter."

But intelligent agents do more than search Web sites for documents they can index. Checkbot, a maintenance utility from the Netherlands, roams sites and reports broken links. Macromedia's GetBot searches sites for examples of Shockwave movies it can point to from its own site.

You may not want spiders crawling through every page on your site. The content of extremely dynamic pages changes so rapidly that the information may become obsolete before the engine even stores the data in its index. You may also choose to reserve space for legitimate visitors because a poorly designed spider, downloading page after page without pause, can overload a server.

Well-behaved agents generate a pause between page fetches. They can also measure server performance and reduce the frequency of site visits for especially slow servers, as Inktomi's Slurp does. WebCrawler spiders conserve bandwidth by hitting a site's first level, or home page, then returning later for in-depth crawls.

"Polite" spiders don't violate bounds placed on them by the Standard for Robot Exclusion (SRE). Under SRE rules, a Webmaster can place a file listing pages to avoid on-site. Compliant spiders check this file, ROBOT.TXT, and exclude its contents from their crawls. A simple ROBOT.TXT file might contain the following lines:

User-agent: *
Disallow: /MyDocs/Personnel
User-agent: Snoopy
Disallow: /

This file prohibits an agent named Snoopy from accessing any file on this Web server. All others, however, are welcome to crawl every page except those in MYDOCS/PERSONNEL. But remember, more than just robots can read this file.

Lycos and AltaVista have reported unauthorized requests to re-register a site with a false address or even delete a site's URLs from an engine's index, presumably to sabotage rival sites. Webmasters, intent on luring visitors to their sites, can find ingenious ways to "spamdex," or fool search engines into raising a site's relevancy score. Spamdexers add multiple occurrences of popular keywords to a Web page in invisible (small or background-colored) type. They "load" meta tags with copies of the same keyword or hide keywords behind images. The keywords ensure frequent appearances and high relevancy scores in result sets even if they bear no relation to the page's contents.

Spamdexing frustrates searchers and can render the relevancy rankings of keyword search engines useless. Search-engine developers are fighting back by changing the way they score sites, penalizing or dropping a site if a keyword appears too many times in a meta tag.

These tactics may help screen out cheaters, but they can also hurt sites that legitimately repeat keywords.

Back to 11/96 Features: The Search Is On (Agents)

Up to Table of Contents

Ahead to 11/96 Features: The Search Is On (Ratings)