Features – Book Review: The Invisible Web

Donna F. Cavallini is Manager of Competitive Knowledge with the law firm of Kilpatrick Stockton. She has a J.D. from St. Louis University School of Law, and was Library Program Administrator for the Florida Attorney General’s Office before joining the firm.

Updated October 1, 2002

The paradox of the Invisible Web is that while it hosts some of the most useful online data, unless a searcher knows exactly where the desired information resides, the most sophisticated search engine or meticulously crafted query won’t be able to find it. The reason? What have collectively become known as the Invisible Web are online sources of information which are blocked from search engine indexing spiders, either by webmaster administrative fiat or by data formatted in such a way that the spider is unable to parse them — or by search engines which simply decide not to include them for various business reasons. A new book by Gary Price and Chris Sherman, The Invisible Web: Uncovering Information Sources Search Engines Can’t See (CyberAge/Information Today), aims to empower searchers to surmount these obstacles, in part by explaining the technical reasons why search engines otherwise inexplicably fail to return relevant results, and in part by providing a directory of selected subject-specific tools for accessing this valuable hidden web content.

The authors are two of the information industry’s most prominent representatives, and their backgrounds, experience and reputations lend serious credibility to the work. Chris Sherman is associate editor of SearchEngineWatch.com, and his articles regularly appear in leading information industry periodicals such as Information Today and Online. Gary Price, a former reference librarian at George Washington University, is now in the library and information consulting business, and is well known for Gary Price’s List of Lists and his weblog, Virtual Acquisition Shelf & News Desk. These credentials have made Sherman and Price experts at understanding the technological infrastructure of the web and practiced in the ways in which its limitations may be overcome.

The Problem of Hidden Content

In the first eight chapters of the book, the authors deconstruct the problem of hidden content, attributing its origins to the web’s increasing complexity, lack of governance and standards, and precipitate growth. They then delineate the various types of invisible web content and explain why the content is invisible to search engines:

  • Disconnectedness – Because indexing spiders only index pages which are linked to other known pages on the web or are directly submitted to search engines by webmasters, newly posted pages suffer from disconnectedness, and therefore won’t appear in search engine query results.
  • File Format Issues – Special file formats such as Postscript, Flash, Shockwave, executables, and compressed file formats (.zip, .tar, etc.), while technically indexable, are ignored by most search engines for policy or economy reasons (AltaVista, for instance, can handle as many as 250 different file formats, but its free public site in fact only indexes a few).
  • Pages Consisting of Primarily Images, Audio or Video – Little or no text provides indexing spiders insufficient information to comprehend page meaning.
  • Relational Databases – Spiders are unable to fill in the blanks in interactive forms which serve as a gateway to data in relational databases
  • Dynamically Generated Content – Spamming tactics such as “spider traps” cause search engines to categorically avoid dynamically generated content on policy grounds.
  • Real Time Content – Content which is ephemeral or rapidly changing by its very nature (e.g., flight tracking data) is of limited historical value and would in time burden servers because of enormous storage requirements.

Subject Specific Practical Solutions for Finding the Information You Need

Having given the searcher the theoretical framework necessary for understanding the nature of the problem of invisible content, the authors then devote the book’s remaining nineteen chapters to subject-specific practical solutions for getting to the data required. In chapters ranging from art and architecture to legal and criminal resources to transportation data, the authors present selected tools for mining the invisible web, and include a chapter on the “Best of the Invisible Web,” noting several general pathfinders which, though they include links to searchable web resources, are useful starting points for invisible web research. Thoughtfully, the authors also have created a free companion website which provides links not only to all resources listed in the book’s directory, obviating the problem of stale published content, but also allowing the inclusion of additional resources which were excluded because of the book’s printed page constraints or which were unavailable at the time of publication.

While the book is clearly well researched and documented, there is some evidence of production and factual errors. For instance, although West Publishing Co. acquired Findlaw and moved its West’s Legal Directory lawyer locator to a subpage of the Findlaw site early this year, the book’s cited URL is still http://www.lawoffice.com instead of http://directory.findlaw.com (although entering the former URL will redirect to the correct site). The book also notes that CompanySleuth corporate information resources are available for free to those who have registered, when in fact the site now merely provides access to news articles through an apparent affiliation with Electric Library. In addition, despite the fact that the book only became available in mid-October when it was supposed to have been available in August – a delay attributed to printing errors – the book’s foreward, written by SearchEngineWatch editor Danny Sullivan, contains duplication of three of the total six paragraphs.

These technicalities aside, one glaringly obvious and inexplicable omission from the book’s coverage is the lack of mention of Lexibot, a metasearch engine utility which simultaneously queries search engines and databases and which was specifically designed and marketed as an invisible web search tool. The authors do critically discuss a study conducted by Bright Planet, Lexibot’s developer, disagreeing with the company’s estimates of the size of the invisible web based on Bright Planet’s inclusion in the scope of its definition of invisible web certain data sources which are technically searchable (ephemeral data, specialized databases). One might be inclined to dismiss the omission on the grounds that the authors intended only to cover pure invisible web resources, but the authors include among the pathfinders in the directory portion of the book resources which they admit are hybrid tools – part searchable web, part invisible web. One might also be inclined to dismiss the omission on the grounds that the authors intended only to cover free resources (beyond a 30-day free trial period, Lexibot is available only to registered purchasers), but there are resources included in the book which are fee-based at least in part, such as Electric Library (searching is free, articles are accessible to paid subscribers). This makes one wonder what other resources and issues the authors failed to address.

Still, the book is a must-read. While the resources selected for the various subject categories in the book’s directory seem to be all the usual suspects – any researcher worth his salt will be familiar with the basic tools of his practice specialty (e.g., for intellectual property researchers, the invisible web Patent and Trademark databases of the USPTO) – a researcher confronted with a research request outside the scope of his normal expertise will find the subject directory a great tool for honing in on quality data amid the web’s dross. What makes the book truly invaluable, however, is that it imparts a technical understanding of why search engines don’t always return the data we expect; armed with this knowledge, researchers will be empowered to improve the effectiveness of their online research, and ultimately to better serve their patrons and client constituencies.

Posted in: Features, Search Strategies