This is the title of a presentation
Large Scale Crawling with
Julien [email protected]
ApacheCon Europe 2012
Apache
I'll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop
About myself
DigitalPebble Ltd, Bristol (UK)
Specialised in Text Engineering
Web Crawling
Natural Language Processing
Information Retrieval
Data Mining
Strong focus on Open Source & Apache ecosystem
Apache Nutch VP
Apache Tika committer
User | Contributor SOLR, Lucene
GATE, UIMA
Mahout
Behemoth
A few words about myself just before I start...What I mean by Text Engineering is a variety of activities ranging from ....What makes the identity of DP is The main projects I am involved in are
Objectives
Overview of the project
Nutch in a nutshell
Nutch 2.x
Future developments
Nutch?
Distributed framework for large scale web crawlingbut does not have to be large scale at all
or even on the web (file-protocol)
Based on Apache Hadoop
Indexing and Search
Apache TLP since May 2010
Note that I mention crawling and not web search used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR
Short history
2002/2003 : Started By Doug Cutting & Mike Caffarella
2004 : sub-project of Lucene @Apache
2005 : MapReduce implementation in Nutch2006 : Hadoop sub-project of Lucene @Apache
2006/7 : Parser and MimeType in Tika2008 : Tika sub-project of Lucene @Apache
May 2010 : TLP project at Apache
June 2012 : Nutch 1.5.1
Oct 2012 : Nutch 2.1
Major Releases
October 2012 2.1
July 2012 2.0
June 2012 1.5
November 2011 1.4
June 2011 1.3
September 2010 1.2
June 2010 1.1
March 2009 1.0
April 2007 0.9
July 2006 0.8
Recent Releases
trunk
2.1
2.0
1.5.1
1.3
1.4
1.1
1.2
1.0
06/12
06/11
06/10
06/09
2.x
Mailing lists
http://pulse.apache.org/#nutch.apache.org
Mailing [email protected] subscribers: 984
Current digest subscribers: 15
Total posts (607 days): 5390
Mean postsperday: 8.88
Mailing [email protected] subscribers: 487
Current digest subscribers: 5
Total posts (607 days): 6099
Mean postsperday: 10.05
Community
6 active committers / PMC members4 within the last 18 months
Constant stream of new contributions & bug reports
Steady numbers of mailing list subscribers and traffic
Nutch is a very healthy 10-year old
Why use Nutch?
Featurese.g. Index with SOLR
PageRank implementation
Can be extended with plugins
Usual reasonsMature, business-friendly license, community, ...
ScalabilityTried and tested on very large scale
Hadoop cluster : installation and skills
Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
Not the best option when ...
Hadoop based == batch processing == high latencyNo guarantee that a page will be fetched / parsed / indexed within X minutes|hours
Javascript / Ajax not supported (yet)
Use cases
Crawl for IRGeneric or vertical
Index and Search with SOLR
Single node to large clusters on Cloud
but alsoData Mining
NLP (e.g.Sentiment Analysis)
ML
MAHOUT / UIMA / GATE
Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth)
Customer cases
Specificity (Verticality)
Scale
Usecase : BetterJobs.comSingle server
Aggregates content from job portals
Extracts and normalizes structure (description, requirements, locations)
~1M pages total
Feeds SOLR index
Usecase : SimilarPages.comLarge cluster on Amazon EC2 (up to 400 nodes)
Fetched & parsed 3 billion pages
10+ billion pages in crawlDB (~100TB data)
200+ million lists of similarities
No indexing / search involved
Typical Nutch Steps
Inject populates CrawlDB from seed list
Generate Selects URLS to fetch in segment
Fetch Fetches URLs from segment
Parse Parses content (text + metadata)
UpdateDB Updates CrawlDB (new URLs, new status...)
InvertLinks Build Webgraph
SOLRIndex Send docs to SOLR
SOLRDedup Remove duplicate docs based on signature
Sequence of batch operations
Or use the all-in-one crawl script
Repeat steps 2 to 8
Same in 1.x and 2.x
Main steps in NutchMore actions availableShell Wrappers around hadoop commands
Main steps
CrawlDBSeed ListSegment
/crawl_generate/
/crawl_fetch//content/
/crawl_parse//parse_data//parse_text/
LinkDB
Main steps in NutchMore actions availableShell Wrappers around hadoop commands
Frontier expansion
Manual discoveryAdding new URLs by hand, seeding
Automatic discovery of new resources (frontier expansion)Not all outlinks are equally useful - control
Requires content parsing and link extraction
seed
i = 1
i = 2
i = 3
[Slide courtesy of A. Bialecki]
An extensible framework
EndpointsProtocol
Parser
HtmlParseFilter (ParseFilter in Nutch 2.x)
ScoringFilter (used in various places)
URLFilter (ditto)
URLNormalizer (ditto)
IndexingFilter
PluginsActivated with parameter 'plugin.includes'
Implement one or more endpoints
Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
Features
FetcherMulti-threaded fetcher
Follows robots.txt
Groups URLs per hostname / domain / IP
Limit the number of URLs for round of fetching
Default values are polite but can be made more aggressive
Crawl Strategy Breadth-first but can be depth-first
Configurable via custom scoring plugins
ScoringOPIC (On-line Page Importance Calculation) by default
LinkRank
Fetcher . multithreaded but polite
Features (cont.)
ProtocolsHttp, file, ftp, https
SchedulingSpecified or adaptative
URL filtersRegex, FSA, TLD, prefix, suffix
URL normalisersDefault, regex
Fetcher . multithreaded but polite
Features (cont.)
Other pluginsCreativeCommons
Feeds
Language Identification
Rel tags
Arbitrary Metadata
Indexing to SOLRBespoke schema
Parsing with Apache TikaHundreds of formats supported
But some legacy parsers as well
Data Structures in 1.x
MapReduce jobs => I/O : Hadoop [Sequence|Map]Files
CrawlDB => status of known pages
CrawlDBMapFile :
byte status; [fetched? Unfetched? Failed? Redir?] long fetchTime; byte retries; int fetchInterval; float score = 1.0f; byte[] signature = null; long modifiedTime; org.apache.hadoop.io.MapWritable metaData;
Input of : generate - index
Output of : inject - update
Writable object crawl datum
Data Structures 1.x
Segment/crawl_generate/ SequenceFile/crawl_fetch/ MapFile/content/ MapFile/crawl_parse/ SequenceFile/parse_data/ MapFile/parse_text/ MapFile
Segment => round of fetching
Identified by a timestamp
Can have multiple versions of a page in different segments
Data Structures 1.x
LinkDBMapFile :
Inlinks : HashSet Inlink : String fromUrlString anchor
Output of : invertlinks
Input of : SOLRIndex
linkDB => storage for Web Graph
NUTCH 2.x
2.0 released in July 2012
2.1 in October 2012
Common features as 1.xdelegation to SOLR, TIKA, MapReduce etc...
Moved to table-based architectureWealth of NoSQL projects in last few years
Abstraction over storage layer Apache GORA
Apache GORA
http://gora.apache.org/
ORM for NoSQL databasesand limited SQL support + file based storage
Serialization with Apache AVRO
Object-to-datastore mappings (backend-specific)
DataStore implementations
0.2.1 released in August 2012
Accumulo
Cassandra
HBase
Avro
DynamoDB (soon)
SQL
AVRO Schema => Java code
{"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }},[]
Mapping file (backend specific Hbase)
DataStore operations
Atomic operationsget(K key)
put(K key, T obj)
delete(K key)
Queryingexecute(Query query) Result
deleteByQuery(Query query)
Wrappers for Apache HadoopGORAInput|OutputFormat
GoraRecordReader|Writer
GORAMapper|Reducer
GORA in Nutch
AVRO schema provided and java code pre-generated
Mapping files provided for backendscan be modified if necessary
Need to rebuild to get dependencies for backendNo binary distribution of Nutch 2.x
http://wiki.apache.org/nutch/Nutch2Tutorial
What does this mean for Nutch?
Benefits
Storage still distributed and replicated
but one big tablestatus, metadata, content, text one place
Simplified logic in NutchSimpler code for updating / merging information
More efficient (?)No need to read / write entire structure to update records
No comparison available yet + early days for GORA
Easier interaction with other resourcesThird-party code just need to use GORA and schema
What does this mean for Nutch?
Drawbacks
More stuff to install and configure :-)
Not as stable as Nutch 1.x
Dependent on success of Gora
2.x Work in progress
Stabilise backend implementationsGORA-Hbase most reliable
Synchronize features with 1.xe.g. has ElasticSearch but missing LinkRank equivalent
Filter enabled scans (GORA-119)Don't need to de-serialize the whole dataset
Future
New functionalities Support for SOLRCloud
Sitemap (from Crawler Commons library)
Canonical tag
More indexers (e.g. ElasticSearch) + pluggable indexers?
Both 1.x and 2.x in parallelbut more frequent releases for 2.x
More delegation
Great deal done in recent years (SOLR, Tika)
Share code with crawler-commons(http://code.google.com/p/crawler-commons/)Fetcher / protocol handling
Robots.txt parsing
URL normalisation / filtering
PageRank-like computations to graph library e.g. Apache Giraph
Should be more efficient as well
Where to find out more?
Project page : http://nutch.apache.org/
Wiki : http://wiki.apache.org/nutch/
Mailing lists : [email protected]
Chapter in 'Hadoop the Definitive Guide' (T. White)Understanding Hadoop is essential anyway...
Support / consulting : http://wiki.apache.org/nutch/Support
Questions
?
/