logo

Web Critic - The Internet Archives: Preserving the History of Web Pages

By Kathy Biehl, Published on November 1, 2001

Attorney and author Kathy Biehl practiced law privately in Houston, Texas for 18½ years before relocating to New York City in 1998. She has taught legal research and writing at the University of Houston Law Center and business law at Rice University. A member of the State Bar of Texas, she earned a B.A. with highest honors from Southern Methodist University and a J.D. with honors from the University of Texas School of Law, where she was a member of Texas Law Review and Order of the Coif. She is co-author of The Lawyer’s Guide to Internet Research (Scarecrow Press, Nov. 2000), with Tara Calishain.

Web Critic evaluates legal research Web sites in terms of the information they convey, how effectively they convey it and how well they take advantage of the possibilities of the Internet -- or don't.

Web Critic Archives

How often have you gone to a web site and not been able to locate a resource
that was there on your last visit? The discovery can be frustrating, even
maddening when you're on a personal mission. During a research project, it can trigger delays and complications (not to mention questioning your sanity or memory), as you scramble to locate what you needed elsewhere. Assuming it still exists -- online or off.

It's a constant problem. Site redesigns and revisions cause all kinds of
information to vanish without a trace. In a physical library, outdated or
superceded volumes may go into storage, where they may gather dust but still remain available for reference. Their counterparts on the Web simply go away, and often with no evidence that they ever existed, like an out-of-favor figure airbrushed from a Stalinist era photograph. 

The implications are profound, particularly for government information. Western Washington University librarian Rob Lopresti illustrated the problem, in a letter to Wired Magazine this past February, with the simple but powerful statement that his own site contained links to more than 70 federal government pages that had disappeared. Ironically, the URL in his letter is dead (and his current home page contains no mention of the issue), but the potential for information gaps remains. If the move, as seems to be occurring, is towards Web-based dissemination and away from print publications, huge holes will inevitably arise as Web pages change. 

Part of the solution is already underway. A tax-exempt organization called the
Internet Archive has assumed the mantle of rescuing public online data from oblivion. Its goal is no less than giving researchers, historians and scholars free and permanent access to public materials. 

Preserving government documents is part of its mission, but by no means all of it. The Archive's explanatory pages speak in terms of protecting cultural
artifacts, cultural heritage and our "right to remember." Academic or
theoretical uses are in the mix as well, such as tracing changes in language,
evaluating how the Web affects commerce, and investigating what the Web tells us about ourselves.

It's a lofty and potentially unwieldy mission, to be sure, but there are early
and significant signs of success. The Smithsonian Institution and the Library
of Congress have signed on as collaborators. More importantly, the Archive is
already staggeringly huge, boasting 10 billion Web pages dating back to 1996.  (Some of the data is donated by Alexa Internet, which makes a navigational aid that displays contact info, site statistics and other details about Web sites as you visit them.) The FAQ describes the size as more than 100 terabytes of data, a measurement that rocketed right out of my frame of reference and into the realm of Ask Jeeves. He pointed me to the explanation that one terabyte is the equivalent of 1000 gigabytes or 50,000 trees made into paper and printed. (You may find more accessible perspective, as I did, in the statement that 100 terabytes are about 10 times the size of the Library of Congress' printed collection.) 

What's in all those pages and how do you reach it? The Archive claims to
contain multiple copies of the entire publicly available web, added to at a
rate of 12 terabytes a month. Some of the data has been grouped into a few
specific collections, which you access by following clear links. (The site is
very clear in design and lay-out, with easy-to-read fonts as well as self-evident language in all navigational aids.) For others, the Archive has a
next-generation search engine called the Wayback Machine (with a jaunty 60s font to match.) No dials and knobs here a la Mr. Peabody; this Wayback Machine uses a three-dimensional index that allows surfing over multiple time
periods. 

The specific archives include two Library of Congress commissions, covering
the 2000 election and September 11. The election archive features sites from five presidential campaigns and political parties and coverage from CNN, The
Washington Times and Yahoo News, along with a directory of 797 sites broken into 15 categories. (For the 1996 election, there's a snapshot of presidential sites displayed by the Smithsonian.) The September 11 collection preserves national and international news and sites from the government, military and charitable organizations. TV news about the day comprises the first exhibit in the Television Archive.

The U.S. government collection (there's no link for it on the top page;
instead, head for the Wayback Machine and it will appear in the special
collection list) displays a hyperlinked thumbnail screen shot to site with
high traffic or number of pages. Among them are the Environmental Protection
Agency, National Park Service, Internal Revenue Service, Securities Exchange Commission, Center for Disease Control, Census Bureau, Department of Education, Food & Drug Administration and the House of Representatives. There's an alphabetical listing, too. With several hundred entries per letter, this is not one for browsing. 

Mr. Peabody's wayback machine landed down with a bit more reliability than
does this site's, but then, his had to transport only two passengers. My random pressing of links in the U.S. government alphabetical listing failed to
turn up a single historical page. Instead they met with "Sorry, no matches," a warning that service was intermittent, and ultimately the announcement that the servers were exceeded and access to past would be available in the future "(perhaps an hour)."

I encountered similar difficulty reaching featured government sites. My luck
change with the IRS page, which showed the Archive's research possibilities. 
It contained a table showing the number and dates of pages available per year. (That would be seven in 1999, 27 in 2000, and 40 (so far) in 2001.) Alas, when I tried to view one, a data retrieval error appeared. 

Until the server capacity increases, I'll be happy to revisit the Archive for
its special collection of Internet pioneers. The first entrees afford a nostalgic stroll past such old favorites as the Trojan Room coffee maker (with a photo of the empty burner, as I always saw it), the first incarnation of the Internet Movie Database, Webcrawler (RIP) and Yahoo, vintage December 1996. (It looks so simple, clean, so … uncluttered.)

Other resources include a collection documenting the development of the U.S.
Department of Defense Advanced Research Projects Agency Network (Arpanet); a Usenet archive from 1996-98 and 2000 on, use of which requires submitting a proposal form; and nearly 1000 industrial, social guidance, educational and advertising films from 1902-1973, digitized by the Prelinger Archives. (First-timers beware: indulging in these movies can be habit-forming.) The Archive also sponsors three mailing lists. One is announcement-only with news about Internet libraries, while the others are discussion lists concerning Internet libraries and online film libraries.

The terms of use limit access to scholarship and research for non-commercial, non-infringing or fair use under copyright law. This restriction is clearly enforceable in the collections that require application and approval, such as Usenet, but policing the use of the open collections seems a lost cause. The copyright policy does provide a procedure for authors to report infringement and request that their material be removed. Also, the contact page, though it doesn't draw attention to it, includes a link to instructions on preventing Alexa's robot from crawling a site for inclusion. 

The privacy policy contains the out of the ordinary warning that some of the
collections are on Unix servers (though it does not specify which) and, as a
result information sharing is possible between users who have logged on to the same Unix machine. This is akin, the policy explains, to librarians and
visitors seeing each other in a physical library and getting a glimpse of each
other's work. 

The Archive's concept is sound and sorely needed, and the scope of the
collections promises to facilitate -- and preserve the integrity of --
historic government research. Watch this resource. If the server capacity expands, it will be a cornerstone of many a project.

ã Kathy Biehl 2001.