Upload
nobinmathew
View
233
Download
0
Embed Size (px)
Citation preview
8/8/2019 Nutch in Nutshell
1/20
Nutch in a Nutshell
Presented by Liew Guo Min
Zhao Jin
8/8/2019 Nutch in Nutshell
2/20
Outline
Recap
Special features Running Nutch in a distributed
environment (with demo)
Q&A
Discussion
8/8/2019 Nutch in Nutshell
3/20
Recap Complete web search engine
Nutch = Crawler + Indexer/Searcher (Lucene) + GUI
+ Plugins+ MapReduce & Distributed FS (Hadoop)
Java based, open source
Features: Customizable
Extensible
Distributed
8/8/2019 Nutch in Nutshell
4/20
Nutch as a crawlerInitial URLs
Generator Fetcher
Segment
Webpages/files
Web
Parsergenerate
Injector
CrawlDB
read/write
CrawlDBTool
update get
read/write
8/8/2019 Nutch in Nutshell
5/20
Special Features Extensible (Plugin system)
Most of the essential functionalities of Nutch
are implemented as plugins Three layers
Extension points
What can be extended: Protocol, Parser, ScoringFilter, etc.
Extensions The interfaces to be implemented for the extension points
Plugins
The actual implementation
8/8/2019 Nutch in Nutshell
6/20
Special Features Extensible (Plugin system)
Anyone can write a plugin
Write the code Prepare metadata files
Plugin.xml: what has been extended by what
Build.xml: how ant can build your source code
Ask nutch to include your plugin in conf/nutch-site.xml
Tell ant to build your in src/plugin/build.xml
More details @
http://wiki.apache.org/nutch/PluginCentral
8/8/2019 Nutch in Nutshell
7/20
Special Features Extensible (Plugin system)
To use a plugin
Make sure you have modified Nutch-site.xml toinclude the plugin
Then, either
Nutch would automatically call it when needed, or
You can write something to call it with its classname and
then use it
8/8/2019 Nutch in Nutshell
8/20
Special Features Distributed (Hadoop)
Map-Reduce (Diagram)
A framework for distributed programming Map -- Process the splits of data to get
intermediate results and the keys to indicate what
should be put together later
Reduce -- Process the intermediate results withthe same key and output final result
8/8/2019 Nutch in Nutshell
9/20
Special Features Distributed (Hadoop)
MapReduce in Nutch
Example1: Parsing Input: files from fetch
Map(url,content) by calling parser plugins
Reduce is identity
Example2: Dumping a segment Input: , etc. files from
segment
Map is identity
Reduce(url, value*) bysimply concatenating the text representation of values
8/8/2019 Nutch in Nutshell
10/20
Special Features Distributed (Hadoop)
Distributed File system Write-once-read-many coherence model
High throughput Master/slave
Simple architecture
Single point of failure
Transparent
Access via Java API More info @
http://lucene.apache.org/hadoop/hdfs_design.html
8/8/2019 Nutch in Nutshell
11/20
Running Nutch in a distributed
environment
MapReduce
In hadoop-site.xml
Specify job tracker host & port
mapred.job.tracker
Specify task numbers
mapred.map.tasks
mapred.reduce.tasks
Specify location for temporary files
Mapred.local.dir
8/8/2019 Nutch in Nutshell
12/20
Running Nutch in a distributed
environment
DFS
In hadoop-site.xml
Specify namenode host, port & directory
fs.default.name
dfs.name.dir
Specify location for files on each datanode
dfs.data.dir
8/8/2019 Nutch in Nutshell
13/20
Demo time!
8/8/2019 Nutch in Nutshell
14/20
Q&A
8/8/2019 Nutch in Nutshell
15/20
Discussion
8/8/2019 Nutch in Nutshell
16/20
Exercises Hands-on exercises
Install Nutch, crawl a few webpages using the crawl commandand perform a search on it using the GUI
Repeat the crawling process without using the crawl command
Modify your configuration to perform each of the following crawljobs and think when they would be useful. To crawl only webpages and pdfs but not anything else
To crawl the files on your harddisk
To crawl but not to parse
(Challenging) Modify Nutch such that you can unpack thecrawled files in the segments back into their original state
8/8/2019 Nutch in Nutshell
17/20
Reference http://wiki.apache.org/nutch/PluginCentral -- Information on Nutch
plugins
http://lucene.apache.org/hadoop/ -- Hadoop homepage
http://wiki.apache.org/lucene-hadoop/ -- Hadoop Wiki http://wiki.apache.org/nutch-
data/attachments/Presentations/attachments/mapred.pdf
"MapReduce in Nutch"
http://wiki.apache.org/nutch-
data/attachments/Presentations/attachments/oscon05.pdf "ScalableComputing with MapReduce
http://www.mail-archive.com/nutch-
[email protected]/msg01951.html Updated tutorial on
setting up Nutch, Hadoop and Lucene together
8/8/2019 Nutch in Nutshell
18/20
Excursion: MapReduce Problem
Find the number of occurrences of cat in a
file
What if the file is 20GB large?
Why not do it with more computers?
SolutionPC1
PC2
200
300
PC1 500Split 1
Split 2File
8/8/2019 Nutch in Nutshell
19/20
Excursion: MapReduce Problem
Find the number of occurrences of both cat
and dog in a very large file
SolutionPC1
PC2
200,
250
300,
250
PC1 cat:500Split 1
Split 2File
cat: 200,
dog: 250
cat: 300,
dog: 250PC2 dog:500
cat: 200,
300
dog: 250,
250
Input Files
Map
Intermediate files
Reduce
Output files
Sort/Group
8/8/2019 Nutch in Nutshell
20/20
Excursion: MapReduce Generalized Framework
Split 1
Split 2
Split 3
Split 4
Worker
Worker
Worker
k1:v1
k3:v2
k1:v3
k2:v4
k2:v5
k4:v6
k1:v1,v2
k2:v4,v5
k3:v2
Worker
Worker
Worker Output 1
Output 2
k4:v6
Output 3
Master
back
Input Files
Map
Intermediate files
Reduce
Output files
Sort/Group