Features - Mining Deeper into the Invisible Web

Diana Botluk is a reference librarian at the Judge Kathryn J. DuFour Law Library at the Catholic University of America in Washington, D.C., and is the author of the The Legal List: Research on the Internet.  She teaches legal research at CAPCON, Catholic University Law School, and the University of Maryland.  Take a class with Diana!  Here's how...

Let me tell you a story. A friend’s son had a homework assignment to use the Internet to find out the amount of the 1998 blue crab harvest from the Chesapeake Bay. She worked with her son, searching online for hours, to try to pinpoint this information using general search engines like Google, but came up empty. She called me, and I managed to pull up the figure she needed in just a few minutes.

I’m not special. Any competent librarian could have done the same thing. And my friend is very internet savvy and comfortable maneuvering in an online environment. Despite this, she wasn’t able to ascertain the answer. Her story is similar to those I’ve heard over and over again from people who are online all the time. People who don’t do this for a living are under the impression that a search in a general search engine will open up the world of information and very simply retrieve the data they need. Then, when it doesn't work, they get frustrated.

Searching for information online is deceptive. General search engines make it seem like it’s supposed to be easy. And sometimes it is easy. You type in american bar association and presto! There’s the link to the home page for the American Bar Association. A few simple successes like that can lull people into a false sense of security, making them believe it should always be that easy, and when it’s not, the information simply isn’t online.

There are a lot of uses for the Internet, and one of them is to make information available to the public. In that way, the Internet can be compared to a library. Sometimes a person can walk into the library, type american bar association into the library catalog, and retrieve the books in the library about the American Bar Association. That isn’t hard, and they don’t need help to do it.

Other times, though, locating the information may be a little more difficult. Sure, the library probably has a book or periodical that will provide data on blue crab harvests in the Chesapeake Bay. But unless the book is called something like “Blue Crab Harvests in the Chesapeake Bay”, simply using the catalog may not be enough. At that point the researcher does what any good library patron would do…she asks the reference librarian. The librarian is quickly able to point her in the direction of those books in the library that are likely to contain that information.

The Extent of the "Deep Web"

The Internet is full of places to locate useful information. Much of the information sits on open web pages, able to be reached by general search engines. But a great deal more of the information sits in databases that general search engines simply cannot reach. It’s not that the information is really hidden or invisible. It’s there, freely available and waiting to be found. The problem is that general search engines are built in such a way that they just can’t go into every single database and search the information contained in each one.

So, the databases online are a lot like the books in the library. They contain the relevant information, but you need to know where to look. Think about it for a second. This is really second nature to all of us to one extent or another. If you wanted to use the library to locate the addresses of bookstores in Annapolis, you wouldn’t type bookstores in annapolis into the library catalog. You’d open the Annapolis yellow pages. Most of the useful information on the Internet is the same way.

In fact, The "Deep" Web: Surfacing Hidden Value, a recent study by Internet company BrightPlanet, estimates there are more than 100,000 searchable databases, containing 550 billion individual documents, available on the web. The contents of these databases are not searchable using general search engines. They estimate that this content is 500 times larger than what is available on the “surface web”, or those pages that a general search engine is capable of searching. 95% of this information is available without subscription or fees. See BrightPlanet Unveils the "Deep" Web: 500 Times Larger Than the Existing Web.

Furthermore, as discussed in LLRX.com’s ResearchWire last year, a February 1999 study in Nature revealed that general search engines, at best, are reaching only 16% of the surface web, what they call the “publicly indexable web” (Steve Lawrence and Lee Giles, Accessibility and Distribution of Information on the Web, a summary of the study Accessibility of information on the web, Nature, Vol. 400, pp. 107-109, 1999). This figure doesn’t even take the invisible or deep web into account. In October 2000, Fast Search and Transfer announced that FAST Search contains 550 million searchable pages. See FAST Announces World’s Largest Search Engine, October 12, 2000. A quick look at Danny Sullivan’s Search Engine Watch showed that his latest size comparison, from June 6, 2000, reveals Google at around 560 million pages, and WebTop.com at around 500 million. Google, WebTop.com and Inktomi all claim to cover just about half of the indexable web, with the estimate of the indexable web being about a billion individual pages.

The BrightPlanet study tells us that the web really contains about 550 billion individual documents, or pages, when you include all those pages that are not indexable, or reachable by general search engines. Thus, for every 1 page a search engine could theoretically reach, there are 549 more pages out there with useful information on them that cannot be reached by search engines. And when you’re talking in terms of billions, that’s an awful lot of information search engines are missing.

No wonder people are getting frustrated. So the next question becomes “how do we deal with this?”

For starters, web portal sites, often the same place you find general search engines, are designed to try and get their users to head in the right direction when they seek information. For example, when seeking information about an individual person, such as their phone number or e-mail address, we know to go to online white pages. Most portal sites have obvious direct links to these white pages type databases right on their front pages.

Search engines themselves have added extra features to their results screens. In addition to regular web results, search engines suggest different databases where the answer might be found. Depending on the search terms, these suggestions vary in real usefulness. A great example of this is AltaVista. When I searched for 1998 chesapeake bay blue crab harvest at AltaVista, I retrieved some web site links that might have ultimately pointed me to the figures I sought had I done some further digging. AltaVista also made some suggestions of where else to look, but most of them had to do with shopping for 1998 Chesapeake Bay blue crabs. I was also directed to the video game, Harvest Moon 64, and Neil Young’s Harvest Moon. In other words, the search engine, hard as it tried, could only examine the string of characters I gave it, and didn’t really “understand” the subject I was searching for. Nor did I expect it to.

Search engines don’t “understand” what you ask for, they simply look at the characters you type in, and try to match them to pages they find. Therefore, responsibility for the search is on the researcher herself. If she understands the research process, then unlocking the information found in deep web databases becomes easy.

Thus, research on the web should be viewed as a two step process. The first step should be figuring out where to look for the data. This is the hard part, but once you locate the right database, finding the specific data should be a breeze.

General search engines can be used as a tool to try to locate databases. For example, when I used AltaVista to try and locate the crab harvest data, I found many good sites related to the Chesapeake Bay, or crabs, or both. But I had to spend some time browsing to reach the place where I could search through the actual harvest data to find my specific answer.

Directories to Database Resources on the Web

A few other sites on the web have been created to make this search for databases easier. One is InvisibleWeb.com. The InvisibleWeb.com is a directory of over 10,000 databases, archives, and search engines that contain information which traditional search engines have been unable to access. It can be searched or browsed by category, just like the larger Yahoo! directory. An InvisibleWeb.com search for 1998 chesapeake bay blue crab harvest yielded no usable results. That raises the point that in step one of the research process, locating a database in which to search, a researcher should try to broaden up the terms searched for a little bit. Think about what the database is likely to be called or how it might be described. Then, once the database is located, a search should be performed using more specific terms. I might have had more luck looking for something like fisheries statistics. Unfortunately, even though InvisibleWeb.com lists over 10,000 useful databases, it didn’t list one that was right for my research. Thus, I moved on to my next tool.

Direct search is a list of databases compiled by Gary Price of George Washington University. This directory appears to limit itself to databases containing more serious or scholarly information, omitting those that exist purely for entertainment value, not bad news for those seeking serious information. Browsing through the list, I found something called FishBase, but when I went there, it was more a database for biological information about fish, and not fishing statistics. I decided to try something else.

Refdesk.com looks at the Internet as a great big library of information, but without organization. This site tries to bring some order to the chaos, finding what useful information is out there, and presenting links to that information in a logical order. This is one of the most useful sites on the Internet for serious researchers. While some of the links are “too alphabetical” and could be subcategorized better, that problem is overcome by the fact that the site is easily searchable. While Refdesk.com has a separate category specifically called “Databases”, the entire site can be seen as a guide to locating the place from where to start your search for information. Still, it wasn’t exactly right for this particular question, so I kept plugging away.

Remember BrightPlanet, the company that did the study on the deep web? BrightPlanet is an Internet company and there was a reason for their research and study. BrightPlanet has two products designed to help researchers dig deeper into the invisible web. The first is CompletePlanet. CompletePlanet is an online directory which links to over 17,000 databases available on the web. The entire directory can be searched, and some of the database links can be found by browsing subject categories. Looking for crabs or crabbing here didn’t help me locate a good database. But when I thought of the problem in terms of fisheries statistics, I found a link to the NOAA Fisheries page, bringing me to the National Marine Services web site. A quick point and click on “latest catch information” brought me to the database where I could find the information I was seeking. From here, I was able to maneuver my way through the database to locate the exact data for Chesapeake Bay crab harvests in 1998.

BrightPlanet’s second product designed to expose the deep web is a piece of software called LexiBot. A researcher would download LexiBot onto their own computer and use it to search for a string of keywords. LexiBot then enters hundred of databases and searches the keywords in each one. It is all accomplished automatically, and it takes what seems like forever and a day. It should be used in the background when you have other work to do. Don’t sit there and wait for the results, like you would with a regular search engine.

At the moment, LexiBot is capable of reaching about 600 databases, and may be running your search concurrently in up to 60 of them at any given time. You can input a simple search and let it run, or you can be a little picky about the databases, choosing them by category or individually, decreasing the number of databases LexiBot searches through and the amount of time the search will take. I decided to try my search using LexiBot.

I downloaded the software and searched 1998 AND “chesapeake bay” AND “blue crab” AND harvest. I had no luck with that, but I couldn’t tell whether it was the software, the search itself, my computer, my Internet connection, or whatever else might be wrong. I decided to try an easier one, so I looked for House Report 106-554 about the Wildlife Sport Fish Restoration Programs Improvement Act of 2000. I know that committee reports are in databases at Thomas and GPO Access, but are not easily searchable with general search engines.

I typed the search string “sport fish restoration programs improvement” AND 106-554 into LexiBot and clicked on search. It started worming its way through dozens of databases, and after a while came back with a list of hits that included the committee report. LexiBot was not fast, but it did manage to find my document successfully. Of course, I knew in advance it would be there, and most likely wouldn’t have bothered to use it for this search.

The bottom line is that when searching for information on the web, researchers often must go beyond typical search engines to locate what they need. We might try using tools like LexiBot or directories like InvisibleWeb.com, or browsing something like Refdesk, or simply digging a little deeper into topical sites that might likely point to a database containing the answer. The point is that often the key to the answer is not locating the answer itself as the first step, but locating the right database in which to search for it. Researchers who bring this knowledge to a search, along with a little creativity and perseverance, will have the best luck locating what they need.

 

Web Sites Mentioned in this article:

Google

The "Deep" Web: Surfacing Hidden Value

"Deep" Web: 500 Times Larger Than the Existing Web

Accessibility of information on the web, Nature, Vol. 400, pp. 107-109, 1999)

FAST Search

Search Engine Watch

WebTop.com

Inktomi

AltaVista

InvisibleWeb.com

Direct search

Refdesk.com

CompletePlanet

LexiBot

High Hopes for Newer Search Technologies