Deep Web Research 2008

Bots, Blogs and News Aggregators is a keynote presentation that I have been delivering over the last several years, and much of my information comes from the extensive research that I have completed over the years into the "invisible" or what I like to call the "deep" web. The Deep Web covers somewhere in the vicinity of 900 billion pages of information located through the world wide web in various files and formats that the current search engines on the Internet either cannot find or have difficulty accessing. Search engines currently locate approximately 20 billion pages.

In the last several years, some of the more comprehensive search engines have written algorithms to search the deeper portions of the world wide web by attempting to find files such as .pdf, .doc, .xls, ppt, .ps. and others. These files are predominately used by businesses to communicate their information within their organization or to disseminate information to the external world from their organization. Searching for this information using deeper search techniques and the latest algorithms allows researchers to obtain a vast amount of corporate information that was previously unavailable or inaccessible. Research has also shown that even deeper information can be obtained from these files by searching and accessing the "properties" information on these files.

This article and guide is designed to give you the resources you need to better understand the history of the deep web research, as well as various classified resources that allow you to search through the currently available web to find those key sources of information nuggets only found by understanding how to search the "deep web".

This Deep Web Research 2008 article is divided into the following sections:


Academic and Scholar Search Engines and Sources

All of OCLC's WorldCat Heading Toward the Open Web by Barbara Quint

An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web by W. Wu, C. Yu, A. Doan, W. Meng

Annotation for the Deep Web

Automatic Extraction of Web Search Interfaces for Interface Schema Integration by H. He, W. Meng, C. Yu, Z. Wu

Automatic Information Extraction From Semi-Structured Web Pages By Pattern Discovery

Automatic Meaning Discovery Using Google by Rudi Cilibrasi and Paul M. B. Vitanyi

Benevolent "Virus" Helps Reveal the Hidden Web

Beyond Google: The Invisible Web - Tools for Teaching the Invisible Web

Bibliomining Bibliography

Bibliomining for Automated Collection Development in a Digital Library Setting: Using Data Mining to Discover Web-Based Scholarly Research Works by Dr. Scott Nicholson and

Bot Research

Client-Side Deep Web Data Extraction

Clustering E-Commerce Search Engines by Q. Peng, W. Meng, H. He, C. Yu

Common Information Environment Seeks To Reveal the Hidden Web,13927,1195901,00.html

Crawling the Hidden Web by Sriram Raghavan and Hector Garcia-Molina

Current Awareness Discovery Tools on the Internet

Data Extraction and Label Assignment for Web Databases

Deep Content - Guide To Effective Searching of the Internet

Deep Web - Exploring the Secrets of the Hiddden Internet by Marcus P. Zillman, M.S., A.M.H.A., - 23 minutes - Internet/Technology Channel

Deep Web Navigation in Web Data Extraction

Desperately seeking Web Search 2.0

DigiCULT Thematic Issue 6 Resource Discovery Technologies for the Heritage Sector, June 2004 Download Thematic Issue 6:Link HiRes .pdf (4,9 MB)

Diving in the Deep End of the Web by Suzanne Ross

Easy Topic Maps

Efficient and Effective Metasearch Project

Google Teams Up with 17 Colleges to Test Searches of Scholarly Materials By Jeffrey R. Young

Graph Structure in the Web

Gray Literature: Resources for Locating Unpublished Research by Brian S. Mathews

Gray Literature Subject Guide

Indexing Deep Web Content By Paul Bruemmer

Information Foraging and Extraction Techniques for Internet-Based Literature and Data

Information Retrieval and the Semantic Web by Tim Finin, James Mayfield, Clay Fink, Anupam Joshi, and R. Scott Cost

In Search of the Deep Web

Invisible Web Gets Deeper

Invisible Web Revealed

IR and IE on the Web - PhD and MSc Dissertations

JEP: The Deep Web

Library Journal: Braking Through the Invisible Web

LLRX: Book Review: The Invisible Web

LLRX: Competitive Intelligence - A Selective Resource Guide

LLRX: Deep Web Research

LLRX: Deep Web Research 2005

LLRX: Deep Web Research 2006

LLRX: Deep Web Research 2007

LLRX: Mining Deeper Into the Invisible Web

LLRX: ResearchWire: Exposing the Invisible Web

Metadata? Thesauri? Taxonomies? Topic Maps! by Lars Marius Garshol

Mining Newsgroups Using Networks Arising From Social Behavior

Mining the Deep Web: Search Strategies That Work by Lee Ratzan

Mining the Deep Web With Specialized Drills

Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews

Mining Topic-Specific Concepts and Definitions on the Web

Modelling and Mining of Network Information Systems Publications

Net Plan Builds in Search by Kimberly Patch

Noisy Channels Models Provide Short Answers to FAQs

Online or Invisible?

OntoMiner: Bootstrapping and Populating Ontologies From Domain Specific Web Sites

OpenIndex - Creating a Public Internet Index

Out-googling Google: Federated Searching and the Single Search Box

PhysicsWeb: The Physics of the Web

Publications about Web Analysis, Web Search, Citation Indexing, Digital Libraries, Machine Learning, Neural Networks [Steve Lawrence, Google Labs]

QProber: Classifying and Searching "Hidden-Web" Text Databases

Research Beyond Google: 119 Authoritative, Invisible, and Comprehensive Resources

Researcher Retrain Thyself

Researchers Map of the Web

Scientific American: Featured Article: The Semantic Web

Scraping the Web for Implied Data

Search Engine Meeting 2005 Boston, Massachusetts - White Papers and Presentations

Search Engine Meeting 2006 Boston, Massachusetts - White Papers and Presentations

Search Engine Meeting 2007 Boston, Massachusetts - White Papers and Presentations

Search Engine Technology and Digital Libraries

Searching the Deep Web

Searching the Deep Web - Video

Searching the Internet (White Paper, Audio and Video)

Seeing through the 'invisible' Web

SemaForm - Semantic Wrapper Generation for Querying Deep Web Data Sources

Semantic Web Content Accessibility Guidelines for Current Research Information Systems (CRIS)by A. Lopatenko

Smart Search - Advanced Search Engines Link Many Data Sources

Structured Databases on the Web: Observations and Implications

Testbed for Information Extraction from Deep Web

The Deep Web

The Deep Web: Surfacing Hidden Value by Michael K. Bergman

The Future Of News: The Digital Information Librarian

The Hidden Potential of the Web,13927,1195901,00.html

The Invisible Web by Chris Sherman

The Invisible Web: What it is, Why it exists, How to find it, and Its Inherent Ambiguity

The Invisible Web: Where Search Engines Fear To Go

The Mechanics of Deep Net Meta Search

The Ultimate Guide to the Invisible Web

Topological Measures and Maps Of the Web

Towards Automatic Incorporation of Search Engines Into A Large-Scale Metasearch Engine

Traffic-Based Feedback on the Web by Jonathan Aizen, Daniel Huttenlocher, Jon Kleinberg, and Antal Novak

Travel Industry and Deep Web: Exclusive Interview with Marcus P. Zillman

UMBC - AgentNews

Understanding Metadata

Using the Internet As a Dynamic Resource Tool for Knowledge Discovery

Web Characterization Project Web Data Extractors White Paper Link Compilation

Web Pages Search Engine Based on DNS by Wang Liang, Guo Yi-Ping, and Fang Ming

WebScales: Towards a Highly Scalable Metasearch Engine

What Is the Deep Web? A WhatIs Podcast 15 Minute Interview with Marcus P. Zillman

What is the Invisible Web? A Crawler Perspective by Natalia Arroyo, Laboratorio de Internet

WISE-Cluster: Clustering E-Commerce Search Engines Automatically by Q. Peng, W. Meng, H. He, C. Yu

Yahoo and the Deep Web

ZDNet: I've Discovered the 'invisible Web'--Have You? Here's How!


Digital Libraries-Cross-Database Search: One-Stop Shopping

Search Tools Reports: Searching for Text Information in Databases

The Right Solution: Federated Search Tools by Roy Tennant

UK Web Archiving Consortium


Entrez - The Life Sciences Cross-Database Search Engine

EnergyFiles - Subject Pathways

GPO Access - Search Across Multiple Databases

King County Library System

NLM Gateway Search


The Metasearch Infrastructure Project


Apple - Mac - Sherlock Blue Angel Technologies

Bright Planet


Cross Database Search Tools Summary

Dieselpoint Java Search and Navigation Software

DbVisualizer - The Universal Database Tool

Dublin Core Metadata Initiative (DCMI)

EEVL Xtra - Cross Database Search

EMC Gold Rush - Database Search Tool


MetaSearch Initiative

mod_oai Project - Getting OAI-PMH For Free


Peter's PolySearch Engines

PBCore - The Public Broadcasting Metadata Dictionary

Registry of Library Knowledge Bases

Search Federal Research and Development

SRU - Search/Retrieve via URL

STINET Multisearch

The Flamenco Search Interface Project

VIAF: The Virtual International Authority File




ALPINE Network - SourceForge: Project

An Efficient Scheme for Query Processing on Peer-to-Peer Networks

Azureus - Java Bittorrent Client


Between Rhizomes and Trees: P2P Information Systems by Bryn Loban



BitTorrent FAQ and Guide

Bit Torrent Official Site and Search Engine

Bitzi - The Free Universal Media Catalog

Blog Torrent


BotSpot®: File-sharing Bots

BTbot BitTorrent Search Engine

BTjunkie - Bittorrent Search Engine

Coral - The Coral P2P Content Distribution Network

Capn's PHP Gnutella Search

Crackle - Stream On

Current P2P Search Implementations - P2P Networks - XDCC Search / File Sharing Portal

Deepnet Explorer - P2P/RSS-ATOM Web Browser

Distributed Search Engines

Distributed Search in P2P Networks

FAROO - P2P Web Search


Free Haven Project

FuzzBox: Tangent Research Artificial Intelligence and Robotics

Gnougat: Fully decentralised file caching from the JXTA Project

GNUnet - GNU Project - Free Software Foundation (FSF)



GRACE - GRid seArch and Categorization Engine

Grid Resources

Grokster3G Source, Distributed Internet Crawler!

Hamachi - Secure Mediated Peer To Peer

HyperCuP - Shaping Up Peer-to-Peer Networks

Ian Clarke's Blog

IM and P2P Threat Center


International Workshop on Peer-to-Peer Knowledge Management (P2PKM)

Internet Movie Database (IMDb)

isoHunt - IRC and Bit Torrent Search Engine

JXTA Project

Kademlia: A Peer-to-peer Information System Based on the XOR Metric

Kazaa Media Desktop


LionShare P2P Project - Legitimate File-Sharing Among Individuals and Educational Institutions

Lphant - The Full P2P Solution


Mercora IM P2P Radio

MoleSter - A Tiny File-Sharing Application


Morpheus: Peer-to-Peer File Sharing Software


MysterNetworks - The Evolution of Peer-to-Peer

Open Directory - File Sharing

Open Directory - MP3 Search Engines

OpenNap: Open Source Napster Server

Oyster - Managing, Searching and Sharing Ontology Metadata in a Peer-to-Peer Network.

P2P and the Future of Private Copying by Peter K. Yu, Michigan State University College of Law

P2PNet - Updated P2P News

P2P News from Tipex

PeerCast P2P Radio

PeerMind - P2P Monitor


Port Knocking

PowerFolder - P2P Whole Folder Synchronization

Project JXTA

Rodi Tiny P2P




Slyck - File Sharing News and Info


Streamload - Share Videos and Photos - Online MP3 Storage and Access

Super-Peer-Based Routing and Clustering Strategies for RDF-Based Peer-to-Peer Networks

Super Powered Peer To Peer

SwarmStream™ SDK

The Anthill Project

The Pirate Bay - BitTorrent Tracker

The Chord Project

The Freenet Project

The Peer-to-Peer Weblog

The Role of Peer to Peer File Sharing in Law Firm Marketing by Andy Havens


Torrent Finder

Torrent Reactor

Torrent Typhoon (TT)

Tribler - A Social Community That Facilitates Filesharing Through P2P


Tubes - Connect Instantly

Understanding BitTorrent: An Experimental Perspective by Arnaud Legout, Guillaume Urvoy-Keller, and Pietro Michiardi

URLBlaze: URL Sharing Network

Videora - Personal Video Using P2P and RSS


WiPeer - Serverless Peer to Peer Collaboration

WiredReach - Powering the User Centric Web

YaCy - Distributed P2P Based Web Indexing and Anonmymous Search Engine

Yahoo! Directory Peer-to-Peer File Sharing

YAPPERS: A Peer-to-Peer Lookup Service over Arbitrary Topology

YouServ - A P2P (peer-to-peer) Web Hosting/File Sharing System Zebra


From Theory To Practice - Bielefeld Academic Search Engine

Gumshoe Librarian

Quick Introduction to OWL Web Ontology Language

Searching the Deep Web - Dudley Knox Library Internet Guides - PowerPoint Slides

Searching the Internet


Deep Web Research A Roadmap for Web Mining: From Web to Semantic Web



Bot Research

BrainBoost - Question Answering Search Engine

BrightPlanet's Deep Federation Portal™ (DFP)

CompletePlanet - 70,000 Databases and Speciality Search Engines

Creative Commons RDF-Enhanced Search

Cyber Cemetery


Cybermtrics - First Generation Tools - Invisible Web

Data Fountains: Open Source Internet Resource Discovery and Metadata/Full-Text Generation Service

Data Mining Resources

Deep Web Research

Deep Web Search

Deep Web Technologies

DigiCULT Resources - Resource Discovery & Information Retrieval


EEVL's Ejournal Search Engines

eHealthcare Bot Deep Meta Search Engine

eMarketing Bot Deep Meta Search Engine


Engineering Village 2

Hakia - Search For Meaning

Find Articles

Freely Accessible Databases for the Public

Ghostscript, Ghostview and GSview

GlobalSpec - Engineering Search Engine

Google Labs

Google Scholar

HighWire Press - Largest Repository of Free Full-Text Life Science Articles in the World


IncyWincy - The Invisible Web Search Engine


Instant Information Systems

Institutional Archives Registry

Intelligence Center


Internet Archive


Invisible Library

Kapow Web Collector

KDnuggets: Data Mining, Web Mining, and Knowledge Discovery Guide


Knowledge Discovery

Large-Scale Deep Web Integration: Incomplete Bibliography

Librarians' Index to the Internet


Mamma - Deep Web Health Search Engine

Mappa.Mundi Magazine

Medical Databases Online

Microsoft Web Search Research and Patents

Mining the Deep Web for Economic Data

Mooter Search

MSN Sandbox

News Group Search

New Zealand Digital Library

OAI-PMH Implementation Guidelines - Conveying rights expressions about metadata in the OAI-PMH framework


OneLook Dictionary Search

Open Archives Initiative

OpenIndex - Creating a Public Internet Index

Open WorldCat-enabled Web Tools

QProber: Classifying and Searching "Hidden-Web" Text Databases - PERSIVAL Project

Quigo Technologies Powerset - Natural Language Semantic Based Web Search Engine

Pretrieve Search - Free Public Record Search Engine Recommended Gateway Sites for the Deep Web


Resource Discovery Network

Science and Technology Sources on the Internet

Scientific and Technical Information Network (STINET)

Science Commons - FirstGov for Science - Government Science Portal

Scirus - Search Engine for Scientific Information

SDARTS - A Protocol and Toolkit for Metasearching

Search Adobe PDF Online

STN International - Databases in Science and Technology

Swoogle - Semantic Bot

TechXtra - Indepth Academic and Scholar Search

Testbed for Information Extraction from Deep Web

The Internet Sleuth

The Deep Web

The Invisible Web

THOR: Deep Web Data Extraction

Those Dark Hiding Places: The Invisible Web Revealed


UNESCO Information Services - Databases

Wall Street Executive Library

Web Data Extractors

Web Farming WebFountain™

Web Intelligence Consortium

Web IR & IE

WebScales: Towards a Highly Scalable Metasearch Engine

Web-Searching Agents


AIS SIGSEMIS - SIGSEMIS: Semantic Web and Information Systems

Analyzing Social Networks on the Semantic Web


Combining RDF and OWL with SOAP for Semantic Web

DARPA Agent Markup Language

DBin Project - Semantic Web P2P and/or Semantic Newsgroup Client

DERI International - Digital Enterprise Research Institute Digital Object Identifier (DOI)

Fabl - A Native Programming Language for the Semantic Web Foundation for Intelligent Physical Agents (FIPA)

The FOAF Project - A Semantic Web Application

hakia - Search for Meaning

HP Labs Semantic Web Research

Infomesh's Semantic Web Introduction

International Journal of Metadata, Semantics and Ontologies (IJMSO)

Jena - A Semantic Web Framework for Java

Journal of Web Semantics

Journal of Web Semantics: Preprint Server

Knowledge Discovery


Knowledge Search

Language Engineering for the Semantic Web: A Digital Library for Endangered Languages

Magpie - The Samatic Filter and Tool For the Semantic Web

MetaData at W3C

Metadata FAQ - Metadata for Education

MindRaider - Semantic Web Outliner



OASIS - Advancing eBusiness Standards

OIL - Ontology Inference Layer

Ontologies for Education (O4E)

Ontology Matching

Ontology Metadata Vocabulary (OMV)


O'Reilly's Semantic Web Primer

Potential Advantages Of Semantic Web For Internet Commerce by Yuxiao Zhao and Kristian Sandahl

Powerset - Natural Language Semantic Based Web Search Engine

pOWL - Semantic Web Development Plattform

Practical Semantic Analysis of Web Sites and Documents

RDF Context Tools

RDF - Resource Description Framework

Rules and Rule Markup Languages for the Semantic Web - RuleML-2003

Science and the Semantic Web

Semantic Blogging: Spreading the Semantic Web Meme

Semantic Desktop Environment - gnowsis

Semantic Email by Luke McDowell, Oren Etzioni, Alon Halevy, and Henry Levy

Semantic Indexing

Semantic Interoperability of Metadata and Information in unLike Environments (SIMILE)

Semantic Knowledge Technologies and Language Computation

Semantic Markup Deconstructed Example

Semantic Planet Weblog

Semantic Routing BOF

Semantic Translator for Enhanced Retrieval by the Bremen University (BUSTER) - The Semantic Web Community Portal

Semantic Web Activity Statement

Semantic Web Application Platform - SWAP

Semantic Web for AURIS-MM

Semantic Web Laboratory

Semantic Web Primer for Object-Oriented Software Developers

Semantic Web Publications

Semantic Web Roadmap

Semantic Web Services Challenge

Semantic Web W3C

SemText - Semantic Hypertext - Making Latent Semantics Blatant

SIG SEMIS Semantic Web and Information Systems

SIMAC - Foafing the Music - Semantic Interaction with Music Audio Contents

SIMILE Project - Semantic Interoperability of Metadata and Information in unLike Environments

SOAPAgent - An Open SOAP Directory Project Info - OWL API

Swoogle - Semantic Bot

SWRL: A Semantic Web Rule Language Combining OWL and RuleML

Technology Review: Sir Tim Berners-Lee - The Semantic Web

The Cover Pages

The Memetic Web

The ontoprise® GmbH

The RDF Query Language (RQL)

The Semantic Grid

The Semantic Social Network by Stephen Downes

The Semantic Web: An Introduction

The Semantic Web By Tim Berners-Lee, James Hendler and Ora Lassila

The Semantic Web In Breadth

The Semantic Indexing Project - Creating Tools To Identify the Latent Knowledge Found in Text

The Semantic Web Is Your Friend

Transforming and Enriching Documents for the Semantic Web by Dietmar Roesner, Manuela Kunze, Sylke Kroetzsch

Twine - A Semantic Web Application That Allows You To Share, Organize, and Find Information UDDI - Universal Description, Discovery, and Integration

Web Semantics: Science, Services and Agents on the World Wide Web Web Service Modeling Ontology

Wilbur Toolkit for Semantic Web Programming

WonderWeb World Wide Web Reference Semantic Web

Yahoo Groups - SemanticWeb


1st Spot

Agent-Based Software Development

Agent Construction Tools



Agent Model Yields Leadership

Agent Portal AI


Agents Portal Alarm Growing Over Bot Software by Robert Lemos


Android World

Applied Soft Computing

B.4.1 Search Robots - The Robots.txt File

Bot A Blog


Bots, Blogs and News Aggregators


BrowseEngine - Real-Time Meta-Data Search Engine

Build a Web Spider on Linux - A Simple Spider and Scraper Collects Internet Content

Cetus Links - Mobile Agents


Data Mining Resources

DataparkSearch Engine - Full-Featured Open Source Web-Based Search Engine


Deep Web Research

Design of a Parallel and Distributed Web Search Engine by Salvatore Orlando, Raffaele Perego, and Fabrizio Silvestri

Dictionary of Algorithms and Data Structures

Eliza - The Original ChatterBot

FAME (Facilitating Agents in Multiculture Exchange) Project

Fantomas Spider Spy™ The BotBase

Foundation for Intelligent Physical Agents


GeneSys Middleware

Google Guide

Indexing Robot Crawler Checklist

Institute for Human and Machine Cognition (IHMC)

Intellexer - Custom Built Search Engines, Knowledge Management Tools, Natural Language Processing

Internet Agents - CWS Apps

Internet Mathematics


Knowledge Discovery

Koders - Source Code Search Engine

LAIR - Research Projects of the Laboratory of Applied Informatics Research

List of User-Agents (Spiders, Robots, Crawler, Browser)

Minimal-Intelligence Agents for Bargaining Behaviors in Market-Based Environments by Dave Cliff and Janet Bruten

MIT Media Lab: Software Agents

Modelling and Mining of Network Information Systems



OpenKapow - Serving Mashups For the Long Tail of the Web

Open Source Web Information Retrieval (OSWIR05)

Oxyus Search Engine - Web Spider and Search Engine

Robots, Spiders and Other User Agents: A Resource for WebMasters

Robots.Txt Checker - Validator for Robots.txt Files

Searchbots - Uniquely Searching the Internet

Search Engine Robots

Search Engine Watch News

Search Tools - Information Guides and News

Semantic Indexing and Search and

Semantic Web


Smarter Bots


Spider Hunter

Spidering Hacks

Structure and Interpretation of Computer Programs - Video Lectures by Hal Abelson and Gerald Jay Sussman

Supybot, A Superb Python IRC Bot

Swoogle - Semantic Bot

The Intelligent Software Agents Lab

The Search Engine Project (TSEP)

The Simon Lavern Page

The Web Robots Pages

UMBC AgentWeb UMBC eBiquity

Webbot - the W3C libwww Robot

Web Curator Tool (WCT)

Web Data Extractors - White Paper Link Compilation

Web Intelligence Consortium

Web IR & IE

Subject Tracer™ Information Blogs

Subject Tracer™ Information Blogs created and developed by the Virtual Private Library™ combine the best of the latest tools on the Internet. Using bots, blogs and news aggregators the Subject Tracer™ Information blogs generate RSS feeds with the latest resources to create a current information resource flow through niched subject tracers. I am proud to be the creator of the Internet’s first Subject Tracer™ Information Blogs:

Virtual Private Library

Accessibility Resources

Agriculture Resources

Artificial Intelligence Resources

Astronomy Resources

Auction Resources

Biological Informatics

Biotechnology Resources

Bot Research

Business Intelligence Resources


Data Mining Resources

Deep Web Research

Directory Resources

eCommerce Resources

Elder Resources

Employment Resources

Entrepreneurial Resources

Financial Sources

Finding People

Games Resources

Genealogy Resources

Grant Resources

Grid Resources http:/

Healthcare Resources

Information Futures Markets

Information Quality Resources

Internet Alerts

Internet Demographics

Internet Experts

Internet Hoaxes

Journalism Resources

Knowledge Discovery

Military Resources

Outsourcing/Offshoring Information and Resources

Privacy Resources

Reference Resources

Research Resources


Script Resources


Social Informatics

Statistics Resources

Student Research

Theology Resources

Tutorial Resources

World Wide Web Reference