Web Data Extractors 2016

Extracting data from the World Wide Web (WWW) has become an important issue in the last few years as the number of web pages available on the visible Internet has grown to over 20 billion pages with over 1 trillion pages available from the invisible web. Tools and protocols to extract all this information have now come in demand as researchers as well as web browsers and surfers want to discover new knowledge at an ever increasing rate! As robots (bots) and intelligent agents are at the heart of many extraction tools I decided to create a compilation of the latest sources and sites that extract information from the web. There are a number of eMail extraction tools still available through the Internet and I have decided not to list these as they aid to the on-going and increasing problem of SPAM except for a readily available DMOZ Directory listing:

Web Data Extractors:

80legs – Powerful and Economical Service Platform for Crawling and Processing Web Content
http://www.80legs.com/

Anthracite
http://freecode.com/projects/anthracite

Aristo – Answer Questions with a Knowledgeable Machine
http://allenai.org/aristo.html

artoo.js – The Client-Side Scraping Companion
http://medialab.github.io/artoo/

Automated RSS Scraper Scripts
http://www.djeaux.com/rss/

Automated Information Solutions
http://www.automated-info-solutions.com/

Automatic Information Extraction From Semi-Structured Web Pages By Pattern Discovery
http://portal.acm.org/citation.cfm?id=640423&dl=ACM&coll=portal

Automation Anywhere – Web Data Extraction Software
http://www.automationanywhere.com/solutions/webDataExt.htm

Beautiful Soup
http://freecode.com/projects/beautifulsoup

Beautiful Soup – HTML/XML Parser for Quick Turnaround Screen Scraping and Web Data Extraction
http://www.crummy.com/software/BeautifulSoup/

BLIASoft Knowledge Discovery
http://www.bliasoft.com/Eindex.html

Bot Research
http://www.BotResearch.info/

BYU Data Extraction Research Group
http://www.deg.byu.edu/

Captiva Software: Digital Information Capture Software
http://www.emc.com/enterprise-content-management/captiva/captiva.htm

ChartSearch Data Search Technology
http://www.ChartSearch.net/

Client-Side Deep Web Data Extraction
http://www.tic.udc.es/~mad/publications/ceceast2004.pdf

Connotate – Web Data Extraction and Monitoring
http://www.connotate.com/

ContextMiner – Tools to Collect Data, Metadata and Contextual Information
http://www.contextminer.org/

cQuery – Content Query Engine
http://cquery.com/

Create a Crawler – Extract Data From an Entire Website
http://support.import.io/knowledgebase/articles/247570-create-a-crawler

cURL groks URLs – Command Line Tool for Transferring Data
http://curl.haxx.se/

Data Extraction Services
http://www.dataextractionservices.com/

Data Mining Resources
http://www.DataMiningResources.info/

Dataminr – Real-time Information Discovery
http://www.dataminr.com/

DataSift – Powerful Social Data Platform
http://datasift.com/

DataWrangler – Data Cleaning and Transformation Tool
http://vis.stanford.edu/wrangler/

Deep Web Research
http://www.DeepWebResearch.info/

DiffBot – Get Data From Web Pages Automatically
http://www.DiffBot.com/

Digital Footprints – Collect Facebook Data
http://digitalfootprints.dk/

DiscoverText – Import, Sort, Distribute and Analyze Electronic Content from eMail, Document Repositories, and Social Media
http://discovertext.com/

Easy PDF Cloud
https://www.easypdfcloud.com/

eGrabber – Data Capture Tools
http://www.egrabber.com/

ExtractData Technologies – SearchExtract Software
http://www.extradata.com/

Facepager – Fetching Public Data From Facebook
https://github.com/strohne/Facepager

FeedsAPI – Extract Content from Web Pages Tool
http://www.feedsapi.com/

Ficstar Software – Web Data Extraction
http://www.ficstar.com/

File Information Tool Set (FITS)
http://projects.iq.harvard.edu/fits

Huginn – Your Agents Are Standing By
https://github.com/cantino/huginn

Imagination Engines
http://www.Imagination-Engines.com/

Import.io – Turn the Web Into Data With Extractors, Crawlers and Connectors
https://import.io/

InfoExtractor – Extracts Relevant Information from Blogs, YouTube and Twitter
http://www.infoextractor.org/

Information Retrieval (IR) and Information Extraction (IE) on the Web
http://www.webir.org/

Introduction to Information Retrieval
http://www-nlp.stanford.edu/IR-book/

iOpus Internet Macros
http://www.iopus.com/imacros/

iRobotSoft – Visual Web Scraping and Web Automation
http://irobotsoft.com/

iWeb Scraping Services
http://www.iwebscraping.com/

jSEO – Web Crawler For Search Engine Optimization
http://codecanyon.net/item/jseo-web-crawler-for-search-engine-optimization/8770392

Junar – Discovering Data
http://www.junar.com/

Karma – Data Integration Tool
https://usc-isi-i2.github.io/karma/

Kimono – Turn Website Into Structured APIs From Your Browser In Seconds
https://www.kimonolabs.com/

Knowledge Discovery Resources
http://www.KnowledgeDiscovery.info/

Knowlesys® – Web Data Extraction, Web Grabber and Screen Scraper
http://www.knowlesys.com/index.htm

LingPipe – Information Extraction and Data Mining Tools
http://alias-i.com/lingpipe/

LoginWorks – Advanced Solutions – Data Mining and Web Scraping
http://www.loginworks.com/

Metadata Extraction Tool
http://meta-extractor.sourceforge.net/

Mozenda – Comprehensive Web Data Gathering
http://www.mozenda.com/

NCapture – Capture Web Content
http://www.qsrinternational.com/products_nvivo_add-ons.aspx

Netlytic – Making Sense of Online Conversations
https://netlytic.org/home/

Newprosoft – Web Data Extraction Software
http://newprosoft.com/

NewsClipper.com – Snip and Ship Dynamic News Content to Your Web Pages
http://www.newsclipper.com/

OutWit Hub – Harvest the Web With Your Own Web Collection Engine
http://www.outwit.com/

Pervasive Data Management and Integration Products
http://www.pervasive.com/

Priceonomics – Crawl Data From the Web
http://priceonomics.com/

QL2 Software – Unstructured Data Management and Web Mining Software
http://www.ql2.com/

OutWit Hub – Harvest the Web With Your Own Web Collection Engine
http://www.outwit.com/

REBOL Technologies
http://www.rebol.com/

Semantic Scholar – Free Scientific Literature Search and Discovery
http://allenai.org/semantic-scholar.html

ScissorsFly – Your Web Clipper and Scrapbook
https://alternativeto.net/software/scissorsfly/

ScrapeForge
http://freecode.com/projects/scrapeforge

Scraper
http://freecode.com/projects/scraper

ScraperWiki – Community of Programmers Sifting Information To Give You the Edge
https://scraperwiki.com/

ScrapeShield – Monitor and Track Misuse of Your Content
https://www.cloudflare.com/apps/scrapeshield

Scrapy – Open Source Web Scraping Framework for Python
http://scrapy.org/

Screen-Scraper
http://freecode.com/projects/screenscraper

Screen-Scraper – Extracts Information From Web Sites
http://www.Screen-Scraper.com/

Screenscraping the Senate by Paul Ford
http://www.xml.com/pub/a/2004/09/01/hack-congress.html

Search and Replace with TextPipe Pattern Matching
http://www.datamystic.com/textpipe.html

Social Media Data Collection Tools
http://socialmediadata.wikidot.com/

Spinn3r – Indexing the Blogosphere
http://www.spinn3r.com/

Squirro – Find, Remember, Organize and Share Important Information
http://squirro.com/

STACKS – Social Media Tracker, Analyzer, & Collector Toolkit at Syracuse
https://github.com/bitslabsyr/stack

Texifter – Search, Sift, Sort, Classify and Analyze
http://texifter.com/

TextRazor – Text Analysis Infrastructure
https://www.textrazor.com/

Topicgrazer – Graze On Web Pages and Documents
http://www.topicscape.com/Topicgrazer/help.php

Unit Miner – Web Data Extraction Software
http://www.unitminer.com/

W3C Publishes Data Extraction Language (DEL) as W3C Note
http://xml.coverpages.org/ni2001-11-06-a.html

Web Data Extraction Software
https://www.automationanywhere.com/webdataext

Web Data Extractor
http://www.rafasoft.com/

Web-Harvest – Open Source Web Data Extraction Tool
http://web-harvest.sourceforge.net/index.php

Website Extractor – Offline Browser
http://www.internet-soft.com/extractor.htm

WebSunDew – Advanced Web Scraping Tool
http://www.websundew.com/

Wikimedia Public Data Dumps
http://meta.wikimedia.org/wiki/Data_dumps

XRay Web Scraping Tool
http://freecode.com/projects/xrayguibasedwebscrapingtool

YaCy Web page Indexer
http://freecode.com/projects/yacy

Posted in: Business Research, Competitive Intelligence, Internet Resources, KM, Social Media, Wiki