Nutch in Nutshell

8/8/2019 Nutch in Nutshell

1/20

Nutch in a Nutshell

Presented by Liew Guo Min

Zhao Jin


2/20

Outline

Recap

Special features Running Nutch in a distributed

environment (with demo)

Q&A

Discussion


3/20

Recap Complete web search engine

Nutch = Crawler + Indexer/Searcher (Lucene) + GUI

+ Plugins+ MapReduce & Distributed FS (Hadoop)

Java based, open source

Features: Customizable

Extensible

Distributed


4/20

Nutch as a crawlerInitial URLs

Generator Fetcher

Segment

Webpages/files

Web

Parsergenerate

Injector

CrawlDB

read/write

CrawlDBTool

update get

read/write


5/20

Special Features Extensible (Plugin system)

Most of the essential functionalities of Nutch

are implemented as plugins Three layers

Extension points

What can be extended: Protocol, Parser, ScoringFilter, etc.

Extensions The interfaces to be implemented for the extension points

Plugins

The actual implementation


6/20


Anyone can write a plugin

Write the code Prepare metadata files

Plugin.xml: what has been extended by what

Build.xml: how ant can build your source code

Ask nutch to include your plugin in conf/nutch-site.xml

Tell ant to build your in src/plugin/build.xml

More details @

http://wiki.apache.org/nutch/PluginCentral


7/20


To use a plugin

Make sure you have modified Nutch-site.xml toinclude the plugin

Then, either

Nutch would automatically call it when needed, or

You can write something to call it with its classname and

then use it


8/20

Special Features Distributed (Hadoop)

Map-Reduce (Diagram)

A framework for distributed programming Map -- Process the splits of data to get

intermediate results and the keys to indicate what

should be put together later

Reduce -- Process the intermediate results withthe same key and output final result


9/20


MapReduce in Nutch

Example1: Parsing Input: files from fetch

Map(url,content) by calling parser plugins

Reduce is identity

Example2: Dumping a segment Input: , etc. files from

segment

Map is identity

Reduce(url, value*) bysimply concatenating the text representation of values


10/20


Distributed File system Write-once-read-many coherence model

High throughput Master/slave

Simple architecture

Single point of failure

Transparent

Access via Java API More info @

http://lucene.apache.org/hadoop/hdfs_design.html


11/20

Running Nutch in a distributed

environment

MapReduce

In hadoop-site.xml

Specify job tracker host & port

mapred.job.tracker

Specify task numbers

mapred.map.tasks

mapred.reduce.tasks

Specify location for temporary files

Mapred.local.dir


12/20

Running Nutch in a distributed

environment

DFS

In hadoop-site.xml

Specify namenode host, port & directory

fs.default.name

dfs.name.dir

Specify location for files on each datanode

dfs.data.dir


13/20

Demo time!


14/20

Q&A


15/20

Discussion


16/20

Exercises Hands-on exercises

Install Nutch, crawl a few webpages using the crawl commandand perform a search on it using the GUI

Repeat the crawling process without using the crawl command

Modify your configuration to perform each of the following crawljobs and think when they would be useful. To crawl only webpages and pdfs but not anything else

To crawl the files on your harddisk

To crawl but not to parse

(Challenging) Modify Nutch such that you can unpack thecrawled files in the segments back into their original state


17/20

Reference http://wiki.apache.org/nutch/PluginCentral -- Information on Nutch

plugins

http://lucene.apache.org/hadoop/ -- Hadoop homepage

http://wiki.apache.org/lucene-hadoop/ -- Hadoop Wiki http://wiki.apache.org/nutch-

data/attachments/Presentations/attachments/mapred.pdf

"MapReduce in Nutch"

http://wiki.apache.org/nutch-

data/attachments/Presentations/attachments/oscon05.pdf "ScalableComputing with MapReduce

http://www.mail-archive.com/nutch-

[email protected]/msg01951.html Updated tutorial on

setting up Nutch, Hadoop and Lucene together


18/20

Excursion: MapReduce Problem

Find the number of occurrences of cat in a

file

What if the file is 20GB large?

Why not do it with more computers?

SolutionPC1

PC2

200

300

PC1 500Split 1

Split 2File


19/20

Excursion: MapReduce Problem

Find the number of occurrences of both cat

and dog in a very large file

SolutionPC1

PC2

200,

250

300,

250

PC1 cat:500Split 1

Split 2File

cat: 200,

dog: 250

cat: 300,

dog: 250PC2 dog:500

cat: 200,

300

dog: 250,

250

Input Files

Map

Intermediate files

Reduce

Output files

Sort/Group


20/20

Excursion: MapReduce Generalized Framework

Split 1

Split 2

Split 3

Split 4

Worker

Worker

Worker

k1:v1

k3:v2

k1:v3

k2:v4

k2:v5

k4:v6

k1:v1,v2

k2:v4,v5

k3:v2

Worker

Worker

Worker Output 1

Output 2

k4:v6

Output 3

Master

back

Input Files

Map

Intermediate files

Reduce

Output files

Sort/Group

Documents

Nutch in Nutshell