ResearchWire - High Hopes for Newer Search TechnologiesBy Genie Tyburski, Published on September 1, 1999
Genie Tyburski is the Research Librarian for Ballard Spahr Andrews & Ingersoll, LLP and the Web Manager of The Virtual Chase: A Research Site for Legal Professionals.
What about those search engines? Authors Lawrence and Giles stirred up a pot of competition since updating and expanding on their earlier search engine research.1 Already both FAST and Excite claim improved technologies for data collection while Lycos asserts SeeMore will find information related to words, phrases or images "on any page on the Internet" (emphasis added).2 Are these useful improvements? What newer technologies exist with potential to enhance our searching of the Web?
To summarize, Lawrence and Giles report:
- The Web stores approximately 800 million pages available for indexing. This equates to six terabytes of text minus the HTML coding.
- Of the phenomenal number of pages available for indexing, engines accomplish 16%, at best.
- Indexing new or modified pages takes months.
- The more popular a page, the more likely engines will index it.
- Only 34% of Web sites use meta tag coding.
- Little overlap of data exists between the engines.
What do these discoveries mean for researchers? To begin with, they illustrate an engine's inability to index all the Web. Note the study emphasizes "publicly indexable" pages; that is, those unique URLs available to search engines.
According to Hobbes' Internet Timeline, the Web hosts 4,389,131 servers as of March 1999. A server houses one or more Web pages. Think about it. The 800 million unique URLs to which the study refers represents a small piece of the Web.
What's missing? To start with, password-restricted free data like the Science Magazine abstract linked in footnote number one or articles appearing in The New York Times or the Financial Times. Absent too are password-restricted commercial data like articles appearing in the Wall Street Journal or Consumer Reports, or public records stored on KnowX or cdb4web. Many search engines also fail to index dynamically delivered data like congressional documents housed at GPO Access, New Jersey statutes, or the codes and rules of professional conduct available from Cornell's American Legal Ethics Library. Moreover, some engines cannot collect data contained within proprietary files like portable document format (.pdf), Word or WordPerfect, image maps or frames.
The authors suggest using meta search services like MetaCrawler to improve coverage of the Web. I disagree. While using sites like MetaCrawler may offer more comprehensive coverage than that of a single search engine, researchers who must search far and wide will do better to search each engine individually. Meta search services often cut off queries before they complete the search of an index in order to increase their speed. Many times, they provide inadequate query translations -- or none at all. Consequently, results may misrepresent the actual data available.
The finding that only 34% of Web sites use meta tag coding presents another dilemma for researchers. The authors write: "The low usage of the simple HTML metadata standard suggests that acceptance and widespread use of more complex standards, such as XML may be slow." This observation means researchers may not see the great potential of XML to deliver the Web as a database in the near future.
I mentioned that newcomer FAST recently announced it now indexes "more than 200 million unique URLs." This coverage places it ahead of Northern Light, the engine the study cited as reaching 16% of the Web. FASTs press release states that it expects to index the entire Web in one year's time and then keep it up to date.
Despite these pronouncements, search engines will continue to face hurdles like password-restrictions, databases, proprietary files and other technologies in their data collection efforts. Any attempt, however, to increase coverage is a step in the right direction. Albeit, a tiny one. As Danny Sullivan points out, sheer scope of coverage without attention to the relevancy of information to a query does little to improve search technology.4
For sheer scope of coverage, not to mention processing speed as its name implies, FAST takes the lead. But anyone who searches FAST will soon discover its weakness -- relevancy. For example, a search for the Long Beach Unified School District does not find a link to the home page within the first page of results. A search for Science Magazine returns links for Science Fiction, Fantasy, Weird Fiction Magazine Index, Science Humor Magazine, and other unrelated items. A search for the laws of Northern Ireland finds news items, a white paper, and constitutional proposals within the first page of hits, but not Her Majesty's Stationery Office, which publishes these laws.
On the other hand, separate queries for school uniforms student behavior and risk factors youth violence found several useful documents, many of which appeared within the first page of hits.
Clearly FAST, although speedy and far-reaching, requires work to be of consistent use to professional researchers. But its ability to index great quantities of Web pages, and to search them rapidly, adds to the current state of affairs on the Web.
Where then might researchers look for technological developments emphasizing relevancy? Currently, Google, Open Directory, Direct Hit and CLEVER warrant mention. The latter, however, is not available for public viewing.6
Believing in popularity as a relevancy indicator, Direct Hit monitors Web sites researchers select from a hit list. It factors in statistics like time spent at a selected site and then applies this information to refine the engines index. If searchers frequently select site X in response to query Y, Direct Hit boosts X sites relevancy ranking.
Unfortunately, success with Direct Hit depends on the choices of earlier searchers, who may or may not have the same information needs including subject familiarity and intended use -- of a current researcher. To test it, I ran the queries above at Hotbot and MSN Search. Although results varied somewhat, I immediately found Long Beach Unified School District and Science Magazine. The search on risk factors youth violence yielded several helpful publications while the query for school uniforms student behavior produced mediocre results. Finally, the search for the laws of Northern Ireland bombed at MSN Search failing to find anything of relevance. It also performed poorly at Hotbot finding only one indirect link.
Formerly NewHoo, Open Directory asks "experts" to index and annotate resources in their area of expertise.7 It offers a category on School Safety, for example, that links to several potentially relevant resources.
Researchers should note that although many popular engines use Open Directory, search results may vary for several reasons. Licensees, for example, may elect to use only some of the Open Directory data. Further, they may rank it according to their own definitions of relevancy.8
Vastly different from other search technologies, excepting possibly CLEVER, Google applies complex algorithms to obtain relevancy. The algorithms in part resemble the concept of citation analysis first employed by Science Citation Index.9 That is, Google follows the basic principle that works cited by a document offer potentially relevant information. Googles creators further apply a number of weights to deal with certain peculiarities of the Web like universally popular sites, competitive sites, or peer-reviewed publications.10
The outcome provides researchers with the first engine to emphasize search precision over recall. Try the searches above at Google. The home page of the Long Beach Unified School District appears as the first result when entering long beach unified school district. Science Magazine pops up as the second hit. A search for the laws of Northern Ireland performs less well: Her Majesty's Stationery Office appears eighth on the hit list!
Queries for school uniforms student behavior and risk factors youth violence also yield highly relevant documents.
Hours before the deadline for this article, Surfwax, a new meta search service, arrives on the scene. Usually, and for reasons I mention above, I avoid meta searchers. In this case, I make an exception.
Developed with technologies that define words and word relationships, Surfwax improves upon existing meta searchers. Enter a query in the search box. Then watch as two frames appear below it. Loading first, the left-hand frame contains results from various search services including FAST and Google.
What distinguishes Surfwax is a small easily overlooked light green icon that sometimes appears to the left of linked hits. Clicking on this icon, loads abstracts, key points and buzzwords from the matching resource in the right-hand frame. Researchers may then select one or more of the buzzwords to refine their searching.11
Legal professionals familiar with Lexis' Core Terms will recognize this searching strategy. To illustrate, enter youth risk factors violence in the query box. Then click on the light green icon that appears next to the matching link, "Youth Violence -- National Center for Injury Prevention ...." Now observe the key points and buzzwords that appear in the right-hand frame.
To focus the query, review the buzzwords and then click on the red icon next to one of them (If you use Netscape, click on the buzzword hyperlink instead.). I selected school discipline. It may appear like nothing happens, but review your search statement, which still appears at the top of the screen. Notice it now contains the phrase, "school discipline" in addition to the original query. Run the new search to find resources discussing the effect of school discipline policies on youth violence.
As experienced online researchers know, the success of a query depends in part on the search statement. Amateur chefs or frantic spouses looking for recipe suggestions for leftover turkey, want to enter more than just the word, turkey. Imagine the reaction of a librarian if a patron approached with just this one word. Yeah, you too, buddy!
Search engines, too, require context, or at least well-defined queries. Yet as illustrated, even straightforward search statements may not produce desired results. What is a poor researcher to do?
First, to locate a starting point or an answer as opposed to all possible answers, begin with an engine like Google that emphasizes relevancy over data collection. Google works so well for this particular research strategy, that in a recent workshop on research strategies for finding government information on the Internet, I had difficulty steering students away from Google U.S. Government Search to examine other techniques.12 Second, for comprehensive research like scouring the Web for use of common law trademarks or discovering competitive intelligence, begin with an enhanced meta search service like Surfwax. Then if necessary, and to accomplish exhaustive searching of the Web to the extent available, query every existing engine separately.
Sound like a lot of work? You bet! The current state of technology demands attention to the limitations of "searching the Web." Yet the development of new services like FAST, Google and Surfwax, as well as new technologies like XML, lend hope to the not-too-distant future.
"Accessibility of Information on the Web," 400 Nature 107 (Jul. 8, 1999). Earlier research published as "Searching the World Wide Web," 280 Science Magazine 98 (Apr. 3, 1998). <back to text>
Search Engine Update, no. 58 (Aug. 2, 1999). Available to paid subscribers at URL http://searchenginewatch.com/subscribers/updates/currentsu.html. <back to text>
The August 2, 1999 issue of Search Engine Watch details the editorial process and its system of checks and balances. Paid subscribers will find it at URL http://searchenginewatch.com/subscribers/updates/currentsu.html. <back to text>
Sullivan, Danny. How the Open Directory Works (updated July 7, 1999). Available to paid subscribers at URL http://searchenginewatch.com/subscribers/opendirectory.html. <back to text>
"Hypersearching the Web," Scientific American (June 1999). Available at URL http://www.sciam.com/1999/0699issue/0699raghavan.html. <back to text>
Ibid. <back to text>
Surfwax offers no information about the origin of these buzzwords, called FocusWords. <back to text>