View
215
Download
0
Category
Preview:
Citation preview
Nutch Search Engine Tool
Nutch overview
A full-fledged web search engine Functionalities of Nutch
Internet and Intranet crawling Parsing different document formats (PDF,
HTML, XML, JS, DOC,PPT etc.) Web interface for querying the index Management of Recrawls
Nutch Architecture
4 main components Crawler Web Database (WebDB, LinkDB, segments) Indexer Searcher
Crawler and Searcher are highly decoupled enabling independent scaling
Highly modular, Plugin based architechture
Nutch Architecture
Doug Cutting, "Nutch: Open Source Web Search", 22 May 2004, WWW2004, New York
Steps in a Crawl+Index cycle
1. Create a new WebDB (admin db -create).
2. Inject root URLs into the WebDB (inject).
3. Generate a fetchlist from the WebDB in a new segment (generate).
4. Fetch content from URLs in the fetchlist (fetch).
5. Update the WebDB with links from fetched pages (updatedb).
6. Repeat steps 3-5 until the required depth is reached.
7. Update segments with scores and links from the WebDB (updatesegs).
8. Index the fetched pages (index).
9. Eliminate duplicate content (and duplicate URLs) from the indexes (dedup).
10.Merge the indexes into a single index for searching (merge).
Crawling (cont.)
Can effectively crawl upto ~100M pages Crawl Statistics on KReSIT site (it.iitb)
Took 153 mins for a deep crawl (depth = 10) Crawled 4171 documents Size of crawl on disk 168MB Size of index ~25MB
Web Database (WebDB)
Persistent data structure for mirroring the structure and properties of the web graph being crawled
The WebDB stores two types of entities Pages Links
Optimised for frequent updation
Crawl Structure of it.iitb
Page DB
Page Database used for fetch scheduling Contains:
pages indexed and sorted by MD5 and URL outlinks, fetch information, page score
A set of APIs are provided to perform the various operations
Sample data of PageDBPage 1: Version: 4URL: http://keaton/tinysite/A.htmlID: fb8b9f0792e449cda72a9670b4ce833aNext fetch: Thu Nov 24 11:13:35 GMT 2005Retries since fetch: 0Retry interval: 30 daysNum outlinks: 1Score: 1.0NextScore: 1.0
Page 2: Version: 4URL: http://keaton/tinysite/B.htmlID: 404db2bd139307b0e1b696d3a1a772b4Next fetch: Thu Nov 24 11:13:37 GMT 2005Retries since fetch: 0Retry interval: 30 daysNum outlinks: 3Score: 1.0NextScore: 1.0
Link DB
Link Database Contains:
links sorted by MD5 links sorted by URL
Represents full link graph. Stores anchor text associated with each link Used for:
Link analysis; Anchor text indexing.
Segments
Collection of pages fetched and indexed by the crawler in a single run
One segment dir for each crawl-fetch-update cycle at a particular depth
Contains raw text and parsed data of the files crawled
Used to return the cached copy of a page and in snippet generation in results page
Segments segread tool gives a useful summary of all
segments. (Parsed, Started, Finished, Dir) It can also be used to dump the segment data in
raw text format. The dump switch gives the following details
Fetcher Output: <url, hash, fetch-date ..>. Entries that go into the WebDB
Content: Raw content including http-headers and other meta-data. stored cached copy of a page
ParseData & ParseText: appropriate parser plugin by looking at the Raw content, is used to generate this data
Nutch API
Plugins
Provide extensions to extension-points Each extension point defines an interface
that must be implemented by extension Some core extension points
IndexingFilter: add meta-data to indexed fields
Parser: to parse a new type of document NutchAnalyzer: language specific analyzers
References
Nutch Docs: http://lucene.apache.org/nutch/
Nutch Wiki: http://wiki.apache.org/nutch/ Prasad Pingali, CLIA consortium, Nutch
Workshop, 2007 Tom White, Introduction to Nutch,
java.net website (http://today.java.net/pub/a/today/2006/01/10/introduction-
to-nutch-1.html)
Recommended