43
Large Scale Crawling with Julien Nioche [email protected] LUCENE/SOLR REVOLUTION EU 2013 Apache and friends...

Large Scale Crawling with Apache Nutch and Friends

  • View
    4.085

  • Download
    2

Embed Size (px)

DESCRIPTION

Presented by Julien Nioche, Director, DigitalPebble This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.

Citation preview

Page 1: Large Scale Crawling with Apache Nutch and Friends

Large Scale Crawling with

Julien [email protected]

LUCENE/SOLR REVOLUTION EU 2013

Apache

and friends...

Page 2: Large Scale Crawling with Apache Nutch and Friends

2 / 43

About myself

DigitalPebble Ltd, Bristol (UK) Specialised in Text Engineering

– Web Crawling– Natural Language Processing– Information Retrieval– Machine Learning

Strong focus on Open Source & Apache ecosystem VP Apache Nutch User | Contributor | Committer

– Tika– SOLR, Lucene – GATE, UIMA– Mahout– Behemoth

Page 3: Large Scale Crawling with Apache Nutch and Friends

3 / 43

Overview

Installation and setup

Main steps

Nutch 2.x

Future developments

Outline

Page 4: Large Scale Crawling with Apache Nutch and Friends

4 / 43

Nutch?

“Distributed framework for large scale web crawling”(but does not have to be large scale at all)

Based on Apache Hadoop

Apache TLP since May 2010

Indexing and Search by

Page 5: Large Scale Crawling with Apache Nutch and Friends

5 / 43

A bit of history

2002/2003 : Started By Doug Cutting & Mike Caffarella

2005 : MapReduce implementation in Nutch

– 2006 : Hadoop sub-project of Lucene @Apache

2006/7 : Parser and MimeType in Tika

– 2008 : Tika sub-project of Lucene @Apache

May 2010 : TLP project at Apache

Sept 2010 : Storage abstraction in Nutch 2.x– 2012 : Gora TLP @Apache

Page 6: Large Scale Crawling with Apache Nutch and Friends

6 / 43

Recent Releases

trunk

2.2.12.0

1.5.11.3 1.41.1 1.21.0

06/1206/1106/1006/09

2.x

06/13

1.7

2.1

1.6

Page 7: Large Scale Crawling with Apache Nutch and Friends

7 / 43

Why use Nutch?

Features– Index with SOLR / ES / CloudSearch– PageRank implementation– Loads of existing plugins– Can easily be extended / customised

Usual reasons– Open source with a business-friendly license, mature, community, ...

Scalability– Tried and tested on very large scale– Standard Hadoop

Page 8: Large Scale Crawling with Apache Nutch and Friends

8 / 43

Use cases

Crawl for search– Generic or vertical– Index and Search with SOLR and al.– Single node to large clusters on Cloud

… but also– Data Mining– NLP (e.g.Sentiment Analysis)– ML

– MAHOUT / UIMA / GATE – Use Behemoth as glueware

(https://github.com/DigitalPebble/behemoth)

with

Page 9: Large Scale Crawling with Apache Nutch and Friends

9 / 43

Customer casesSpecificity (Verticality)

Size

BetterJobs.com (CareerBuilder)– Single server

– Aggregates content from job portals

– Extracts and normalizes structure (description, requirements, locations)

– ~2M pages total

– Feeds SOLR index

SimilarPages.com– Large cluster on Amazon EC2 (up to 400

nodes)

– Fetched & parsed 3 billion pages

– 10+ billion pages in crawlDB (~100TB data)

– 200+ million lists of similarities

– No indexing / search involved

Page 10: Large Scale Crawling with Apache Nutch and Friends

10 / 43

http://commoncrawl.org/

Using Nutch 1.7 A few modifications to Nutch code

– https://github.com/Aloisius/nutch

Next release imminent

Open repository of web crawl data 2012 dataset : 3.83 billion docs ARC files on Amazon S3

CommonCrawl

Page 11: Large Scale Crawling with Apache Nutch and Friends

11 / 43

Overview

Installation and setup

Main steps

Nutch 2.x

Future developments

Outline

Page 12: Large Scale Crawling with Apache Nutch and Friends

12 / 43

Installation

http://nutch.apache.org/downloads.html

1.7 => src and bin distributions 2.2.1 => src only

'ant clean runtime'– runtime/local => local mode (test and debug)– runtime/deploy => job jar for Hadoop + scripts

Binary distribution for 1.x == runtime/local

Page 13: Large Scale Crawling with Apache Nutch and Friends

13 / 43

Configuration and resources

Changes in $NUTCH_HOME/conf– Need recompiling with 'ant runtime'– Local mode => can be made directly in runtime/local/conf

Specify configuration in nutch-site.xml– Leave nutch-default alone!

At least :

<property>  <name>http.agent.name</name>  <value>WhateverNameDescribesMyMightyCrawler</value></property>

Page 14: Large Scale Crawling with Apache Nutch and Friends

14 / 43

Running it!

bin/crawl script : typical sequence of steps

bin/nutch : individual Nutch commands– Inject / generate / fetch / parse / update ….

Local mode : great for testing and debugging

Recommended : deploy + Hadoop (pseudo) distrib mode – Parallelism– MapReduce UI to monitor crawl, check logs, counters

Page 15: Large Scale Crawling with Apache Nutch and Friends

15 / 43

Monitor Crawl with MapReduce UI

Page 16: Large Scale Crawling with Apache Nutch and Friends

16 / 43

Counters and logs

Page 17: Large Scale Crawling with Apache Nutch and Friends

17 / 43

Overview

Installation and setup

Main steps

Nutch 2.x

Future developments

Outline

Page 18: Large Scale Crawling with Apache Nutch and Friends

18 / 43

Typical Nutch Steps

1) Inject → populates CrawlDB from seed list

2) Generate → Selects URLS to fetch in segment

3) Fetch → Fetches URLs from segment

4) Parse → Parses content (text + metadata)

5) UpdateDB → Updates CrawlDB (new URLs, new status...)

6) InvertLinks → Build Webgraph

7) Index → Send docs to [SOLR | ES | CloudSearch | … ]

Sequence of batch operations

Or use the all-in-one crawl script

Repeat steps 2 to 7

Same in 1.x and 2.x

Page 19: Large Scale Crawling with Apache Nutch and Friends

19 / 43

Main steps from a data perspective

CrawlDBSeed List Segment

/crawl_generate//crawl_fetch//content//crawl_parse//parse_data//parse_text/

LinkDB

Page 20: Large Scale Crawling with Apache Nutch and Friends

20 / 43

Frontier expansion

Manual “discovery”– Adding new URLs by

hand, “seeding”

Automatic discovery of new resources (frontier expansion)– Not all outlinks are

equally useful - control– Requires content

parsing and link extraction

seed

i = 1

i = 2

i = 3

[Slide courtesy of A. Bialecki]

Page 21: Large Scale Crawling with Apache Nutch and Friends

21 / 43

An extensible framework

Endpoints– Protocol– Parser– HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x)– ScoringFilter (used in various places)– URLFilter (ditto)– URLNormalizer (ditto)– IndexingFilter– IndexWriter (NEW IN 1.7!)

Plugins– Activated with parameter 'plugin.includes'– Implement one or more endpoints

Page 22: Large Scale Crawling with Apache Nutch and Friends

22 / 43

Features

Fetcher– Multi-threaded fetcher– Queues URLs per hostname / domain / IP– Limit the number of URLs for round of fetching– Default values are polite but can be made more aggressive

Crawl Strategy – Breadth-first but can be depth-first– Configurable via custom ScoringFilters

Scoring– OPIC (On-line Page Importance Calculation) by default– LinkRank

Page 23: Large Scale Crawling with Apache Nutch and Friends

23 / 43

Features (cont.)

Protocols– Http, file, ftp, https– Respects robots.txt directives

Scheduling– Fixed or adaptive

URL filters– Regex, FSA, TLD, prefix, suffix

URL normalisers– Default, regex

Page 24: Large Scale Crawling with Apache Nutch and Friends

24 / 43

Features (cont.)

Other plugins– CreativeCommons– Feeds– Language Identification– Rel tags– Arbitrary Metadata

Pluggable indexing– SOLR | ES etc...

Parsing with Apache Tika– Hundreds of formats supported– But some legacy parsers as well

Page 25: Large Scale Crawling with Apache Nutch and Friends

25 / 43

Indexing

Apache SOLR– schema.xml in conf/– SOLR 3.4 – JIRA issue for SOLRCloud

• https://issues.apache.org/jira/browse/NUTCH-1377

ElasticSearch– Version 0.90.1

AWS CloudSearch– WIP : https://issues.apache.org/jira/browse/NUTCH-1517

Easy to build your own– Text, DB, etc...

Page 26: Large Scale Crawling with Apache Nutch and Friends

26 / 43

Typical Nutch document

Some of the fields (IndexingFilters in plugins or core code)– url– content– title– anchor– site– boost– digest– segment– host– type

Configurable ones– meta tags (keywords, description etc...)– arbitrary metadata

Page 27: Large Scale Crawling with Apache Nutch and Friends

27 / 43

Overview

Installation and setup

Main steps

Nutch 2.x

Future developments

Outline

Page 28: Large Scale Crawling with Apache Nutch and Friends

28 / 43

NUTCH 2.x

2.0 released in July 2012

2.2.1 in July 2013

Common features as 1.x– MapReduce, Tika, delegation to SOLR, etc...

Moved to 'big table'-like architecture– Wealth of NoSQL projects in last few years

Abstraction over storage layer → Apache GORA

Page 29: Large Scale Crawling with Apache Nutch and Friends

29 / 43

Apache GORA

http://gora.apache.org/

ORM for NoSQL databases– and limited SQL support + file based storage

Serialization with Apache AVRO

Object-to-datastore mappings (backend-specific)

DataStore implementations

Current version 0.3

● Accumulo● Cassandra● HBase

● Avro● DynamoDB● SQL (broken)

Page 30: Large Scale Crawling with Apache Nutch and Friends

30 / 43

AVRO Schema => Java code

{"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }},[…]

Page 31: Large Scale Crawling with Apache Nutch and Friends

31 / 43

Mapping file (backend specific – Hbase)

<gora-orm> <table name="webpage"> <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters --> <family name="f" maxVersions="1"/> <family name="s" maxVersions="1"/> <family name="il" maxVersions="1"/> <family name="ol" maxVersions="1"/> <family name="h" maxVersions="1"/> <family name="mtdt" maxVersions="1"/> <family name="mk" maxVersions="1"/> </table> <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"> <!-- fetch fields --> <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/>

Page 32: Large Scale Crawling with Apache Nutch and Friends

32 / 43

DataStore operations

Basic operations– get(K key) – put(K key, T obj)– delete(K key)

Querying– execute(Query<K, T> query) → Result<K,T>– deleteByQuery(Query<K, T> query)

Wrappers for Apache Hadoop– GORAInput|OutputFormat– GoraRecordReader|Writer– GORAMapper|Reducer

Page 33: Large Scale Crawling with Apache Nutch and Friends

33 / 43

GORA in Nutch

AVRO schema provided and java code pre-generated

Mapping files provided for backends

– can be modified if necessary

Need to rebuild to get dependencies for backend– hence source only distribution of Nutch 2.x

http://wiki.apache.org/nutch/Nutch2Tutorial

Page 34: Large Scale Crawling with Apache Nutch and Friends

34 / 43

Benefits

Storage still distributed and replicated

… but one big table

– status, metadata, content, text → one place

– no more segments

Resume-able fetch and parse steps

Easier interaction with other resources

– Third-party code just need to use GORA and schema

Simplify the Nutch code

Potentially faster (e.g. update step)

Page 35: Large Scale Crawling with Apache Nutch and Friends

35 / 43

Drawbacks

More stuff to install and configure– Higher hardware requirements

Current performance :-(– http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html– N2+HBase : 2.7x slower than 1.x– N2+Cassandra : 4.4x slower than 1.x– due mostly to GORA layer : not inherent to Hbase or Cassandra– https://issues.apache.org/jira/browse/GORA-119 → filtered scans– Not all backends provide data locality!

Not as stable as Nutch 1.x

Page 36: Large Scale Crawling with Apache Nutch and Friends

36 / 43

2.x Work in progress

Stabilise backend implementations– GORA-Hbase most reliable

Synchronize features with 1.x– e.g. missing LinkRank equivalent (GSOC 2013 – use Apache Giraph)– No pluggable indexers yet (NUTCH-1568)

Filter enabled scans– GORA-119

• => don't need to de-serialize the whole dataset

Page 37: Large Scale Crawling with Apache Nutch and Friends

37 / 43

Overview

Installation and setup

Main steps

Nutch 2.x

Future developments

Outline

Page 38: Large Scale Crawling with Apache Nutch and Friends

38 / 43

Future

New functionalities – Support for SOLRCloud– Sitemap (from CrawlerCommons library)– Canonical tag– Generic deduplication (NUTCH-656)

1.x and 2.x to coexist in parallel– 2.x not yet a replacement of 1.x

Move to new MapReduce API– Use Nutch on Hadoop 2.x

Page 39: Large Scale Crawling with Apache Nutch and Friends

39 / 43

More delegation

Great deal done in recent years (SOLR, Tika)

Share code with crawler-commons(http://code.google.com/p/crawler-commons/)– Fetcher / protocol handling– URL normalisation / filtering

PageRank-like computations to graph library– Apache Giraph– Should be more efficient + less code to maintain

Page 40: Large Scale Crawling with Apache Nutch and Friends

40 / 43

Longer term

Hadoop 2.x & YARN

Convergence of batch and streaming– Storm / Samza / Storm-YARN / …

End of 100% batch operations ?– Fetch and parse as streaming ?– Always be fetching– Generate / update / pagerank remain batch

See https://github.com/DigitalPebble/storm-crawler

Page 41: Large Scale Crawling with Apache Nutch and Friends

41 / 43

Where to find out more?

Project page : http://nutch.apache.org/ Wiki : http://wiki.apache.org/nutch/ Mailing lists :

[email protected][email protected]

Chapter in 'Hadoop the Definitive Guide' (T. White)– Understanding Hadoop is essential anyway...

Support / consulting : – http://wiki.apache.org/nutch/Support

Page 42: Large Scale Crawling with Apache Nutch and Friends

42 / 43

Questions

?

Page 43: Large Scale Crawling with Apache Nutch and Friends

43 / 43