WinMag * March 1997 * Cover Story

Cover Story
Notes From The Lab
Site Retrievers

-- by Lenny Bailes

During our testing of Web page-retrieval packages, we learned exactly how these programs work. We also discovered that the methods used to store Web information on your hard disk have changed since the previous generation of products in this category. Our testing revealed that these changes yield both benefits and limitations.

For example, ForeFont's original WebWhacker 1.0 simply grabbed files from a Web server piece by piece, mirroring the server's Web directory structure on your local hard disk. If necessary, internal directory references in the server-based HTML code were modified to adjust for relative and absolute file references. This approach was workable in the pre-Java days of simple text and graphics, but Web sites have grown considerably more complex, with large multimedia files or references to CGI and Java applets stored in protected areas on the server.

To handle the increased physical size of Web sites, page-retrieval programs now pack most of the material they download into proprietary compressed archives. The programs use this process to simplify the tracking of internal directory references, thus making it possible to perform comprehensive searches on an entire archived Web site. This technique, however, makes it difficult to work with a downloaded site's individual components. For example, you can't transfer the downloaded site to another computer, edit its HTML code, or save or print individual graphics. Only one program we reviewed--Web Buddy--lets you export a saved site into browser-linked components on your hard disk. The vendors of the other site retrievers we tested indicated plans to implement export functions in future versions.

All the vendors told us their programs should support downloading and offline browsing of embedded multimedia, Java and Microsoft ActiveX objects. But in our testing we found retrieval and playback of these functioned erratically. We also discovered a problem related to processing CGI scripts. Interactive Web sites that accept user input and update themselves in real time are a new challenge for Web page retrievers. During our tests, we found that we could not monitor and download messages at conferencing sites such as Howard Rheingold's Electric Minds (http://www.minds.com). In fact, we had to disable two of the programs before input from this site would appear in our Web browser. One program developer said the problem is due to the retrieval programs' inability to follow internal jump references placed by the site in the browser's cookies file.

Review: WebWhacker 2.0