ResearchWire – Searching on the Web: Adventure an Abyss

Genie Tyburski is the Research Librarian for Ballard Spahr Andrews & Ingersoll in Philadelphia, Pennsylvania and the editor of The Virtual Chase:TM A Research Site for Legal Professionals .

(Archived June 1, 1998)


Without question, search services provide the most popular, if not always the best, method for finding information on the Web. Yet, time and again, I hear complaints about the quantity and quality of information retrieved when using them. In a recent telephone conversation, for example, one of my colleague’s growled about a hit list in excess of 40,000 links resulting from an AltaVista search for information about an aspect of Americans with Disabilities Act compliance. After methodically connecting to links from the first three screens of hits and finding nothing relevant to the query, she threw up her hands in exasperation and quit the Web.

It’s nothing but hype after all, isn’t it?

Hmm. While I believe many exaggerate the power of the Web, in this instance, the query failed because the researcher lacked understanding about the capabilities of search engines and perhaps, about the Web itself.

The Web is not a database and therefore, not comparable to familiar online legal research systems like Lexis or Westlaw. Rather, it is a hypermedia system that enables integration of multiple data types (text, graphics, video and sound).

To understand it, envision a giant spider’s web. Then imagine traversing this web by starting at any point and moving in any direction. Simply by following the web’s joints, one advances or cedes to the cunning of a spider.

Of course, no hungry spider awaits those lost in cyberspace. Yet knowing the sundry ways in which search services collect and index data, and how these methods affect retrieval of information, may avoid frustration like that experienced by my colleague and friend.

First, recognize that search services fall into four broad categories – catalogs or indexes, databases, engines, and meta searchers. Catalogs, interchangeably called indexes or directories, logically arrange pointers to information. Typically organized by subject, they comprise services like Yahoo!, FindLaw and CataLaw. In addition, humans, sometimes librarians, maintain catalog sites assigning resources to appropriate categories.

Databases encompass many familiar research services like Lexis-Nexis Research on the Web, westlaw.com, DialogWeb, Dun & Bradstreet, Dow Jones Publication Library. The category also includes countless new resources like CDC Prevention Guidelines Database, V., LOIS, and U.S. Patent Bibliographic Database. Databases consist of collections of special material not necessarily available elsewhere on the Web.

Engines, the popular term for single search servers, describe services that index key fields or terms within a document or the entire document. Services like AltaVista, HotBot, Infoseek, Excite, Lycos and WebCrawler fall into this category.

Meta searchers, like Inference Find, MetaCrawler, Savvy Search, and Cyber411, offer researchers a way to search more than one engine or catalog site simultaneously. Those in the group do not maintain their own indexes, but use those built by the engines or catalogs. No two meta searchers work alike, but attractive features at some include speedy searching and the ability to weed out duplicates. Disadvantages include the inability to conduct a complete search of indexes, and lack of complex features like field restrictions or even simple notions like phrase or proximity searching.1

Various factors influence how search services, specifically, catalogs and engines, collect data. First, and possibly at this time the most prevalent method of data collection, involves URL submissions. Webmasters either directly, or indirectly via the use of software or services, submit information about their sites to the catalogs and engines. If performed regularly and appropriately, without the use of spam techniques, this method assists search services in maintaining fresh data.

Another collection technique involves robot exploration. Used by engines, in theory this method entails connecting to a specific URL and collecting data from the page. Then by following all the links of the page and those of its sub-pages, the engine gathers information about the entire site or domain. In practice robot exploration often proves shallow.2

Many engines collect data only from top-level pages. Some visit sites too infrequently to maintain fresh information. Others add data only after determining the “popularity” of a site. “Popularity” refers to the number of sites linking to a specific domain.

According to Search Engine Watch, HotBot, Lycos and WebCrawler base collection decisions in part on a site’s popularity. Of course, to determine popularity, the engine must know about the linking sites.3

Moreover, various factors hinder engine data collection methods. Sites using frames, image maps to display content, or re-direct commands, for example, may unwittingly exclude all or some of their data from engines unable to deal with these technologies.4

Engines also cannot gather data from password sites. Other restrictions include the use of a robots.txt file or a robot meta tag.

Document types may further affect an engine’s ability to collect data. Most, for example, cannot explore word-processed files or desktop-publishing formats like Envoy (.evy) or portable document format (.pdf).

How do these hindrances effect the researcher? Consider sites using passwords for access, or proprietary software to create or index documents. None of the content from these sites resides in an engine. Researchers using engines to find information, therefore, may miss materials like articles from The New York Times (password required), court opinions offered by Pennsylvania State court sites (word-processed documents), New Jersey statutes (Folio infobase), or legislation provided by GPO Access (re-direct commands and .pdf).

Data collection comprises a portion of the factors influencing information retrieval. Another concern entails indexing methods.

For example, does the search service use human indexers? If so, are they professional librarians who possess the necessary skills to categorize data?

With respect to engines, what data does the robot software index? Data contained within the title field? Meta tags? The first few words of the document? The entire document? It is not safe to assume that all engines index all data contained within a document.

Moreover, even if an engine indexes everything, keyword research strategies are inherently fallible. Consider this example. Thomas, a Library of Congress site offering congressional documents, provides both keyword and subject term access to bill summaries. Subject term refers to access via the Legislative Indexing Vocabulary, a controlled thesaurus created by legislative analysts at the Library of Congress.

Connect to Thomas’ Bill Summary & Status database. Use option #1, the keyword feature, to search for the truncated term, encrypt*. This strategy locates seven (7) potentially relevant bill summaries. Now select option #2, the Legislative Indexing Vocabulary. Enter the recommended descriptor for encryption — cryptography. This method yields fourteen (14) bill summaries relating to the topic.5

By analogy, engines depend on this deficient keyword research method. Seeking information about euthanasia, researchers using engines may not consider related terms like “mercy killing,” “assisted suicide,” “death with dignity,” or “right to die.”

In addition to recognizing the imperfection of indexing and data collection methods, researchers should understand how search services rank results. Methods include ranking by the presence of search terms in key fields like the title or URL, the frequency of search terms throughout a document, the availability of a review, or even bidding.

Search Engine Watch indicates that both HotBot and Infoseek rank by the appearance of search terms in a document’s meta tag field.6 This gives control to Webmasters, who may or may not use such power, or apply it appropriately or wisely.

WebCrawler, on the other hand, ranks by a combination of site popularity and occurrence of search terms in a document title. Excite gives precedence to resources it reviews.

Perhaps most disturbing, GoTo.com, a newcomer created by idealab!, ranks sites by the amount they pay for their placement. The higher the bid for placement, the higher the site appears in a hit list.

As if data gathering, indexing and relevancy ranking methods lacked sufficient impact on information retrieval, researchers need heed the effect of spamming. Spamming, on the one hand, occurs when Webmasters, through the use of meta tags, assign popular but irrelevant keywords to a document. Common terms for this method include “sex” or sexually explicit language. A similar technique involves placing terms in the meta tag field that describe a competitor’s Web site.7

Another spamming method involves repeating terms in a meta tag field. I came across a political site one day that repeated “campaign” 31 times using this method. Still another tactic entails hiding terms within a document. Webmasters accomplish this by making the font color of the hidden words the same as the color of the document background.

After delving into the intricacies of search services, and especially those of engines, readers understandably might conclude they should avoid them at all costs. This is not the case. Search services, many times, provide an effective means for locating information. It is important, however, that one understands their capabilities and limitations. Consider other methods for discovering information.8 But if another strategy fails, search services may provide the only means by which to find an answer.

********************

Stay tuned. Next month, Diana Botluk, ResearchWire’s new co-author, will examine and compare the features of major search services.

********************

Footnotes

  1. Inference Find offers phrase searching; but note the limitations. < back to text>

  2. For more information about the exploration practices of specific search engines, see Search Engine EKGs by Search Engine Watch at URL http://www.searchenginewatch.com/reports/ekgs/. <back to text>

  3. See Search Engine Features Chart , Search Engine Watch , 31 March 1998, at URL http://searchenginewatch.com/webmasters/features.html. <back to text>

  4. For more information about these technologies, see the University of Virginia Library’s Short Course Handouts Online at URL http://poe.acc.virginia.edu/~libinstr/handouts.html. <back to text>

  5. Research performed April 3, 1998. <back to text>

  6. Supra, note 2. <back to text>

  7. See, for example, Oppedahl & Larson v. Advanced Concepts, where defendant allegedly used “Oppedahl” and “Larson” in the meta tag field of some of its Web pages. <back to text>

  8. See Flashback! Employing Traditional Research Techniques on the Web, Law Library Resource Xchange, 16 December 1996, at URL //www.llrx.com/columns/flash.htm. <back to text>

Posted in: ResearchWire, Search Engines