Deep Web Research and Discovery Resources 2014

Bots, Blogs and News Aggregators ( is a keynote presentation that I have been delivering over the last several years, and much of my information comes from the extensive research that I have completed over this time into the “invisible” or what I like to call the “deep” web. The Deep Web covers in the vicinity of trillions upon trillions of pages of information located through the world wide web in various files and formats that the current search engines on the Internet either cannot find or have difficulty accessing. The current search engines find hundreds of billions of pages at the present time of this writing.

In the last several years, some of the more comprehensive search engines have written algorithms to search the deeper portions of the world wide web by attempting to find files such as .pdf, .doc, .xls, ppt, .ps. and others. These files are predominately used by businesses to communicate information within their organization, or to disseminate information to the external world from their organization. Searching for this information using deeper search techniques and the latest algorithms allows researchers to obtain a vast amount of corporate information that was previously unavailable or inaccessible. Research has also shown that even deeper information can be obtained from these files by searching and accessing the “properties” information on these files!

This report and guide is designed to give you the resources you need to better understand the history of the deep web research, as well as various classified resources that allow you to search through the currently available web to find those key sources of information nuggets only found by understanding how to search the “deep web”.

This Deep Web Research and Discovery Resources 2014 report and guide is divided into the following sections:

Articles, Papers, Forums, Audios and Videos
Cross Database Articles
Cross Database Search Services
Cross Database Search Tools
Peer to Peer, File Sharing, Grid/Matrix Search Engines
Resources – Deep Web Research
Resources – Semantic Web Research
Bot and Intelligent Agent Research Resources and Sites


99 Resources to Research & Mine the Invisible Web by Jessica Hupp

Academic and Scholar Search Engines and Sources

All of OCLC’s WorldCat Heading Toward the Open Web by Barbara Quint

An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web by W. Wu, C. Yu, A. Doan, W. Meng

Annotation for the Deep Web

Automatic Extraction of Web Search Interfaces for Interface Schema Integration by H. He, W. Meng, C. Yu, Z. Wu

Automatic Information Extraction From Semi-Structured Web Pages By Pattern Discovery

Automatic Meaning Discovery Using Google by Rudi Cilibrasi and Paul M. B. Vitanyi

Beyond Google: The Invisible Web – Tools for Teaching the Invisible Web

Bibliomining Bibliography (Outdated)

Bibliomining for Automated Collection Development in a Digital Library Setting: Using Data Mining to Discover Web-Based Scholarly Research Works by Dr. Scott Nicholson

Bot Research

Client-Side Deep Web Data Extraction

Clustering E-Commerce Search Engines by Q. Peng, W. Meng, H. He, C. Yu

Creating Intelligence from Big Data

Current Awareness Discovery Tools on the Internet

Data Extraction and Label Assignment for Web Databases

Deep Web – Exploring the Secrets of the Hidden Internet by Marcus P. Zillman, M.S., A.M.H.A., – 23 minutes – Internet/Technology Channel

Desperately Seeking Web Search 2.0

Digging Deeper into Deep Web Databases by Breaking Through the Top-k Barrier

DigiCULT Thematic Issue 6
Resource Discovery Technologies for the Heritage Sector, June 2004

Effective and Scalable Metasearch Project

Efficient Deep Web Crawling Using Reinforcement Learning

Experiences In Crawling Deep Web In The Context Of Local Search

Grey Literature

Grey Literature Network Service (GreyNet)

Information Retrieval and the Semantic Web by Tim Finin, James Mayfield, Clay Fink, Anupam Joshi, and R. Scott Cost

In Search of the Deep Web

Invisible Web Gets Deeper

Invisible Web Revealed

IR and IE on the Web – PhD and MSc Dissertations

LLRX: Book Review: The Invisible Web

LLRX: Deep Web Research

LLRX: Deep Web Research 2005

LLRX: Deep Web Research 2006

LLRX: Deep Web Research 2007

LLRX: Deep Web Research 2008

LLRX: Deep Web Research 2009

LLRX: Deep Web Research 2010

LLRX: Deep Web Research 2011

LLRX: Deep Web Research 2012

LLRX: Deep Web Research 2013

LLRX: Deep Web Research 2014

LLRX: Mining Deeper Into the Invisible Web

LLRX: ResearchWire: Exposing the Invisible Web

Metadata? Thesauri? Taxonomies? Topic Maps! by Lars Marius Garshol

Mining Newsgroups Using Networks Arising From Social Behavior

Mining the Deep Web: Search Strategies That Work by Lee Ratzan

Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews

Mining Topic-Specific Concepts and Definitions on the Web

Net Plan Builds in Search by Kimberly Patch

Online or Invisible? [Requires Login]

OntoMiner: Bootstrapping and Populating Ontologies From Domain Specific Web Sites

OpenIndex – Creating a Public Internet Index

Out-googling Google: Federated Searching and the Single Search Box

Publications about Web Analysis, Web Search, Citation Indexing, Digital Libraries, Machine Learning, Neural Networks [Steve Lawrence, Google Labs]

QProber: Classifying and Searching “Hidden-Web” Text Databases

Research Beyond Google: 119 Authoritative, Invisible, and Comprehensive Resources

Scientific American: Featured Article: The Semantic Web

Search Engine Meeting

Search Engine Technology and Digital Libraries

Searching the Deep Web by Alex Wright

Searching the Deep Web

Searching the Deep Web – Video

Searching the Internet (White Paper, Audio and Video)

Search Interfaces on the Web: Querying and Characterizing by Denis Shestakov

Seeing through the ‘invisible’ Web

Semantic Web Content Accessibility Guidelines for Current Research Information Systems (CRIS) by A. Lopatenko

Structured Databases on the Web: Observations and Implications

Testbed for Information Extraction from Deep Web

The Deep Web: Semantic Search

The Deep Web: Surfacing Hidden Value by Michael K. Bergman;rgn=main

The Future Of News: The Digital Information Librarian

The Hidden Potential of the Web

The Invisible Web by Chris Sherman

The Invisible Web: What it is, Why it exists, How to find it, and Its Inherent Ambiguity

The Invisible Web: Where Search Engines Fear To Go

The Ultimate Guide to the Invisible Web

The Virtual Private Library(TM) and The Deep Web Video by Melissa Barker

Timeline of Events Related to the Deep Web

Topological Measures and Maps Of the Web

Toward the Semantic Deep Web by James Geller, Soon Ae Chun, and Yoo Jung An

Towards Automatic Incorporation of Search Engines Into A Large-Scale Metasearch Engine

Traffic-Based Feedback on the Web by Jonathan Aizen, Daniel Huttenlocher, Jon Kleinberg, and Antal Novak

Travel Industry and Deep Web: Exclusive Interview with Marcus P. Zillman

UMBC – AgentNews

Understanding Metadata

Understanding the Deep Web In 10 Minutes

Using the Internet As a Dynamic Resource Tool for Knowledge Discovery

Web Characterization Activity

Web Data Extractors White Paper Link Compilation

Web Pages Search Engine Based on DNS by Wang Liang, Guo Yi-Ping, and Fang Ming

WebScales: Towards a Highly Scalable Metasearch Engine

What Is the Deep Web? A WhatIs Podcast 15 Minute Interview with Marcus P. Zillman

What is the Invisible Web? A Crawler Perspective by Natalia Arroyo, Laboratorio de Internet

Wikipedia – Deep Web

WISE-Cluster: Clustering E-Commerce Search Engines Automatically by Q. Peng, W. Meng, H. He, C. Yu


Search Tools Reports: Searching for Text Information in Databases

The Right Solution: Federated Search Tools by Roy Tennant

UK Web Archiving Consortium


EnergyFiles – Subject Pathways [Oil Gas production and forecasting]

FDsys – Search Across Multiple Government Databases

King County Library System

NLM Gateway Search

SUMSearch 2 [Health Sciences]

Scirus – Scientific Information Only [Retired in early 2014]

The Metasearch Infrastructure Project


Bright Planet – Deep Web Intelligence


Dieselpoint Java Search and Navigation Software

DTIC Multisearch – Information for the Defense Community

Dublin Core Metadata Initiative (DCMI)

EEVL Xtra – Cross Database Search

Gold Rush – Database Search Tool


MetaSearch Initiative


Peter’s PolySearch Engines

PBCore – The Public Broadcasting Metadata Dictionary

Registry of Library Knowledge Bases

Search Federal Research and Development

SRU – Search/Retrieve via URL

The Flamenco Search Interface Project

VIAF: The Virtual International Authority File


ALPINE Network – SourceForge: Project

Azureus – Vuze Java Bittorrent Client

BadBlue [Uncensored News]

Between Rhizomes and Trees: P2P Information Systems by Bryn Loban


Bitmessage – P2P Communication Protocol To Send Encrypted Messages

Bit Torrent Official Site and Search Engine

Coral – The Coral P2P Content Distribution Network

Capn’s PHP Gnutella Search [Only code is available for download]

ClearBits – BitTorrent distribution of open licensed media

Deepnet Explorer – Web Browser

Distributed Search Engines

Distributed Search in P2P Networks

DirecTransFile – P2P File Transfers

FAROO – P2P Web Search

FilesOverMiles – Browser to Browser File Sharing (P2P)

Filetopia – File sharing tool with public key encryption

Free Haven Project

Frost Project – Freenet Messaging and File Sharing Client

FuzzBox: Tangent Research Artificial Intelligence and Robotics

GNUnet – Secure P2P Networking – Free Software Foundation (FSF)

Grid, Distributed and Cloud Computing Resources

GNU GRUB – Multiboot Boot Loader

Ian Clarke’s Blog

iMesh [Free Legal Music]

International Workshop on Peer-to-Peer Knowledge Management (P2PKM)

Internet Movie Database (IMDb)

Kademlia: A Peer-to-peer Information System Based on the XOR Metric [Citeseer Login Required]

Lphant – The Full P2P Solution

MoleSter – A Tiny File-Sharing Application

MusicBrainZ – Open Music Encyclopedia

MysterNetworks – The Evolution of Peer-to-Peer

Open Directory – File Sharing

Open Directory – MP3 Search Engines

OpenNap: Open Source Napster Server

P2P and the Future of Private Copying by Peter K. Yu, Michigan State University College of Law

Peer-To-Peer Wikipedia

Peer to Peer File Sharing – P2P Networking


Port Knocking

PowerFolder – P2P Whole Folder Synchronization

Rodi – Tiny P2P Client/Host



Slyck – File Sharing News and Info

Super-Peer-Based Routing and Clustering Strategies for RDF-Based Peer-to-Peer Networks [CiteSeer Login Required]

Swarm – A Transparently Scalable Distributed Programming Language

The Anthill Project

The Pirate Bay – BitTorrent Tracker

The Freenet Project

The Peer-to-Peer Weblog [Last updated 2010]

The Role of Peer to Peer File Sharing in Law Firm Marketing by Andy Havens


Torrent Reactor

Tribler – A Social Community That Facilitates Filesharing Through P2P


Understanding BitTorrent: An Experimental Perspective by Arnaud Legout, Guillaume Urvoy-Keller, and Pietro Michiardi

WASTE (Secure P2P communication)

YaCy – Distributed P2P Based Web Indexing and Anonmymous Search Engine

Yahoo! Directory Peer-to-Peer File Sharing

YAPPERS: A Peer-to-Peer Lookup Service over Arbitrary Topology [CiteSeer Login Required]

YouServ – A P2P (peer-to-peer) Web Hosting/File Sharing System

Zebra – Structured Text Indexing and Retrieval

Zilok – Peer To Peer Rental Marketplace


Deep Web

Deep Web and Darknet – What Lies Beyond the Surface of the World Wide Web – The Colin McEnroe Show On WNPR

From Theory To Practice – Bielefeld Academic Search Engine

Gumshoe Librarian

Searching the Internet Whitepaper

The Virtual Private Library(TM) and The Deep Web Video by Melissa Barker

RESOURCES – Deep Web Research

AEON (Automatic Evaluation of ONtologies)

AnkaSearch – Meta Search and Deep Web Search Desktop Tool

A Roadmap for Web Mining: From Web to Semantic Web

AskReddit – What Are Your Experiences With the Deep Web

Biznar – Deep Federated Search

Bot Research

BrightPlanet – Deep Web Intelligence

Catalog of U.S. Government Publications (CGP)

Cazoodle – Search, Integrate, and Organize — The Real World

CompletePlanet – 70,000 Databases and Speciality Search Engines

Creative Commons RDF-Enhanced Search

Cyber Cemetery

Cybermetrics – First Generation Tools – Invisible Web

Data Mining Resources

Deep Web Research Resources

Deep Web Search

Deep Web Technologies – federated search

Directory Resources

eFinancial Bot Deep Meta Search Engine

eGreenBot – Green Resources Search Engine

eHealthcare Bot Deep Meta Search Engine

eMarketing Bot Deep Meta Search Engine


Engineering Village

Falcons Semantic Web Search Engine

Federated Search Blog

Freely Accessible Databases for the Public

Google Fusion Tables

Google Scholar

HighWire Press – Largest Repository of Free Full-Text Life Science Articles in the World


Internet Archive

Invisible Library

Kapow Web Collector

Karma – Data Integration Tool

KDnuggets: Data Mining, Web Mining, and Knowledge Discovery Guide

Knowledge Discovery

Large-Scale Deep Web Integration: Incomplete Bibliography

Linked Data – Connect Distributed Data Across the Web

LinkingOpenData – W3C SWEO Community Project


Mappa.Mundi Magazine

Mednar – Innovative Medical Search

Microsoft Bing Web Search Research and Patents

Mining the Deep Web for Economic Data

New Zealand Digital Library

OAI-PMH Implementation Guidelines – Conveying rights expressions about metadata in the OAI-PMH framework


OECD.StatExtracts – Complete Databases Available Via OECD’s iLibrary

OneLook Dictionary Search

Open Archives Initiative

OpenIndex – Creating a Public Internet Index

Open Source Intelligence

QProber: Classifying and Searching “Hidden-Web” Text Databases – PERSIVAL Project

Recommended Gateway Sites for the Deep Web

ReportLinker: Industry Reports, Company Profiles and Market Statistics

SAO/NASA Astrophysics Data System (ADS)

Science Accelerator – Search Key Resources from DOE OSTI


Science and Technology Sources on the Internet

Scientific and Technical Information Network (STINET)

Science Commons – FirstGov for Science – Government Science Portal – Deep Web Search Engine

Scirus – Search Engine for Scientific Information [Closed early 2014]

SciTech Connect

SDARTS – A Protocol and Toolkit for Metasearching

SIMILE Widgets – Free, Open-Source Data Visualization Web Widgets and More

Social Buzz Bot (PDF download)

STN International – Databases in Science and Technology

Swoogle – Semantic Bot

SWRC Ontology

TechDeepWeb – How-To Guide to the Deep Web for IT Professionals

TechXtra – Indepth Academic and Scholar Search

Testbed for Information Extraction from Deep Web

The Invisible Web

The World Bank – Data

THOR: Deep Web Data Extraction

Tor Browser Bundle – Anonymity

TRID – The TRIS and ITRD Database (Transportation Research Board)

Twitter/Search #deepweb

UNdata – Data Access System To UN Databases

UNESCO Information Services – Databases

Useful Tips and Tools to Research the Deep Web

Wall Street Executive Library

Web Data Extractors

Web Farming

WebFountain(TM) – Analytical engine unstructured data

Web IR & IE

WebScales: Towards a Highly Scalable Metasearch Engine

WTO Statistics Database

Zaba Search – Free People Search and Public Information Search Engine

RESOURCES – Semantic Web Research

4Store – An Efficient, Scalable and Stable RDF Database

Analyzing Social Networks on the Semantic Web

DARPA Agent Markup Language

DBin Project – Semantic Web P2P and/or Semantic Newsgroup Client.

Digital Object Identifier (DOI)

Falcons Semantic Web Search Engine

FOAF Project – A Semantic Web Application

Foundation for Intelligent Physical Agents (FIPA)

GistWeb – Gist of Any Web Page Actual Content

Go3R – Knowledge Based Semantic Search Engine To Avoid Animal Experiments

GoodRelations Vocabulary – Semantic Web Based eCommerce

Infomesh’s Semantic Web Introduction

International Journal of Metadata, Semantics and Ontologies (IJMSO)

International Journal on Semantic Web and Information Systems (IJSWIS)

Jena – A Semantic Web Framework for Java

Journal of Biomedical Semantics

Journal of Web Semantics

Journal of Web Semantics: Preprint Server

Knowledge Discovery


Language Engineering for the Semantic Web: A Digital Library for Endangered Languages

Linked Open Data from the New York Times

Magpie – The Samatic Filter and Tool For the Semantic Web

MetaData at W3C

MindRaider – Semantic Web Outliner

OASIS – Advancing eBusiness Standards

Ontology Matching

Ontology Metadata Vocabulary (OMV)

O’Reilly’s Semantic Web Primer

Potential Advantages Of Semantic Web For Internet Commerce by Yuxiao Zhao and Kristian Sandahl [CiteSeer Login Required]

pOWL – Semantic Web Development Plattform

Practical Semantic Analysis of Web Sites and Documents [CiteSeer Login Required]

RDF Context Tools

RDF – Resource Description Framework

Rules and Rule Markup Languages for the Semantic Web – RuleML-2003 – Interlinking the Web of Data

SAO/NASA Astrophysics Data System (ADS)

Semantic Knowledge Technologies and Language Computation – The Semantic Web Community Portal

Semantic Web Activity Statement

Semantic Web Application Platform – SWAP

Semantic Web for AURIS-MM

Semantic Web Primer for Object-Oriented Software Developers

Semantic Web Roadmap

Semantic Web Search Engine (SWSE)

Semantic Web Services Challenge

Semantic Web – The Voice of Semantic Web Technology

Semantic Web W3C

SenseBot – Semantic Search Engine That Finds Sense On the Web

Simile Widgets – Free, Open-Source Data Visualization Web Widgets and More

Sindice – The Semantic Web Index Project Info – OWL API

Swoogle – Semantic Bot

SWRL: A Semantic Web Rule Language Combining OWL and RuleML

The Authoritative Resource List for the Semantic Web by Kaila Strong

The Cover Pages

The RDF Query Language (RQL)

The Semantic Web: An Introduction

The Semantic Web By Tim Berners-Lee, James Hendler and Ora Lassila

The Semantic Web In Breadth

The Semantic Web Is Your Friend

Transforming and Enriching Documents for the Semantic Web by Dietmar Roesner, Manuela Kunze, Sylke Kroetzsch

uClassify – Free Text Classified Web Service

Watson Web – Exploring the Semantic Web

Web Semantics: Science, Services and Agents on the World Wide Web

Web Service Modeling Ontology

Wilbur Toolkit for Semantic Web Programming [Project no longer actively maintained]

World Wide Web Reference Semantic Web

Yahoo Groups – SemanticWeb

Bot and Intelligent Agent Research Resources and Sites

1st Spot

80legs – Powerful and Economical Service Platform for Crawling and Processing Web Content

Agent Construction Tools


Agent Model Yields Leadership [2004 article]


AgentSheets – Authoring Tool to Create Agents


Applied Soft Computing

Article Search API – New York Times Articles 1981 to Present

Bots, Blogs and News Aggregators


cQuery – Content Query Engine

Data Mining Resources

DataparkSearch Engine – Full-Featured Open Source Web-Based Search Engine

Deep Web Research (PDF download)

Design of a Parallel and Distributed Web Search Engine by Salvatore Orlando, Raffaele Perego, and Fabrizio Silvestri

Dictionary of Algorithms and Data Structures

Eliza – The Original ChatterBot

FAME (Facilitating Agents in Multiculture Exchange)Project

File Information Tool Set (FITS)

Foundation for Intelligent Physical Agents

Google Guide

Imagination Engines

Indexing Robot Crawler Checklist

Information Retrieval Intelligence

Institute for Human and Machine Cognition (IHMC)

Intellexer – Custom Built Search Engines, Knowledge Management Tools, Natural Language Processing

Intelligent Information Systems Research Laboratory

International Journal of Agent-Oriented Software Engineering (IJAOSE)

Knowledge Discovery

Koders – Source Code Search Engine

LAIR – Laboratory of Applied Informatics Research

List of User-Agents (Spiders, Robots, Crawler, Browser)

Minimal-Intelligence Agents for Bargaining Behaviors in Market-Based Environments by Dave Cliff and Janet Bruten

MIT Media Lab: Software Agents

Modelling and Mining of Network Information Systems

Mozenda Web Agent Builder – Web Data Extraction


MySpiders [CiteSeer Login Required]

Open Source Web Information Retrieval (OSWIR05)

Oxyus Open Source Search Engine

Searchbots – Uniquely Searching the Internet

Search Engine Robots

Search Engine Watch News

Search Tools – Information Guides and News

SeerSuite – CiteSeerX Toolkit

Semantic Web


Siri – Your Virtual Personal Assistant

Smarter Bots

SocialBuzzBot – The Business and Social Intelligence Search Engine for Information Discovery from Social Communities

SocSciBot – Social Sciences Link Analysis Research

Spidering Hacks

Spinn3r: RSS Content, News Feeds, News Content, News Crawler and Web Crawler APIs

Structure and Interpretation of Computer Programs – Video Lectures by Hal Abelson and Gerald Jay Sussman

Supybot, A Superb Python IRC Bot

Swoogle – Semantic Bot

TextRunner Search – Searches Hundreds of Millions of Assertions Extracted from 500 Million High-Quality Web Pages

The Intelligent Software Agents Lab

The Lemur Toolkit – Language Modeling and Information Retrieval Research

The Search Engine Project (TSEP)

The Simon Lavern Page

TSEP – The Search Engine Project

UMBC AgentWeb

UMBC eBiquity

Web Curator Tool (WCT)

Web Data Extractors – White Paper Link Compilation

Web Intelligence Consortium

Web IR & IE

WolframAlpha Computational Knowledge Engine – Trillions of Pieces of Curated Data and Millions of Lines of Algorithms

Posted in: Blogs, Business Research, Competitive Intelligence, Data Mining, Internet Trends, Search Engines, Search Strategies