32
Intro to Web Search Michael J. Cafarella December 5, 2007

Intro to Web Search Michael J. Cafarella December 5, 2007

Embed Size (px)

Citation preview

Page 1: Intro to Web Search Michael J. Cafarella December 5, 2007

Intro to Web Search

Michael J. CafarellaDecember 5, 2007

Page 2: Intro to Web Search Michael J. Cafarella December 5, 2007

Searching the Web Web Search is basically a database problem, but no one uses SQL databases Every query is a top-k query Every query plan is the same Massive numbers of queries and data Read-only data

A search query can be thought of in SQL terms, but the engineered system is completely different

Page 3: Intro to Web Search Michael J. Cafarella December 5, 2007

Nutch & Hadoop: A case study Open-source, free to use and change Nutch is a search engine, can handle ~200M pages

Hadoop is backend infrastructure, biggest deployment is a 2000 machine cluster

There have been many different search engine designs, but Nutch is pretty standard and easy to learn from

Page 4: Intro to Web Search Michael J. Cafarella December 5, 2007

Outline Search basics

What are the elementary steps? Nutch design

Link database, fetcher, indexer, etc…

Hadoop support Distributed filesystem, job control

Page 5: Intro to Web Search Michael J. Cafarella December 5, 2007

Search document model Think of a “web document” as a tuple with several columns: Incoming link text Title Page content (maybe many sub-parts) Unique docid

A web search is really SELECT * FROM docs WHERE docs.text LIKE ‘userquery’ AND docs.title LIKE ‘userquery’ AND … ORDER BY ‘relevance’

Where relevance is very complicated…

Page 6: Intro to Web Search Michael J. Cafarella December 5, 2007

Search document model (2) Three main challenges to processing a query: Processing speed Result relevance Scaling to many documents

Page 7: Intro to Web Search Michael J. Cafarella December 5, 2007

Processing speed You could grep, but each query will need to touch each document

Key to fast processing is the inverted index

Basic idea is: for each word, list all the documents where that word can be found

Page 8: Intro to Web Search Michael J. Cafarella December 5, 2007

as

billy

cities

friendly

give

mayors

nickels

seattle

such

words

#docs docid0 docid1

#docs docid0

#docs docid0 docid1 docid2 docid#docs-1…#docs docid0 docid1 docid2

#docs docid0 docid1

#docs docid0 docid1 docid2 docid#docs-1…

#docs docid0

#docs docid0 docid1 docid2 docid3

#docs docid0 docid1 docid2 docid#docs-1…

#docs docid0 docid1 docid2 docid#docs-1…

Page 9: Intro to Web Search Michael J. Cafarella December 5, 2007

as

billy

cities

friendly

give

mayors

nickels

seattle

such

words

#docs docid0 docid1 docid2 docid#docs-1…

104 21 150 322 2501

15 99 322 426 1309

1.Test for equality2.Advance smaller pointer3.Abort when a list is exhausted

Returned docs: 322

Query: such as

#docs docid0 docid1 docid2 docid#docs-1…

Page 10: Intro to Web Search Michael J. Cafarella December 5, 2007

Result Relevance Modern search engines use hundreds of different clues for ranking Page title, meta tags, bold text, text position on page, word frequency, …

A few are usually considered standard tfidf(t, d) = freq(t-in-d) / freq(t-in-corpus)

Link analysis: link counting, PageRank, etc

Incoming anchor text Big gains are now hard to find

Page 11: Intro to Web Search Michael J. Cafarella December 5, 2007

Scaling to many documents Not even the inverted index can handle billions of docs on a single machine

Need to parallelize query Segment by document Segment by search term

Page 12: Intro to Web Search Michael J. Cafarella December 5, 2007

“britney

Scaling (2) doc segmenting

Docs 0-1M Docs 1-2M Docs 2-3M Docs 3-4M Docs 4-5M

“britney”“britney

“britney

“britn

ey

”“bri

tney

Ds 1, 29

Ds 1.2M,

1.7M

Ds 2.3M,

2.9M

Ds 3.1M,

3.2M

Ds 4.4M,

4.5M

1.2M, 4.4M, 29,

Page 13: Intro to Web Search Michael J. Cafarella December 5, 2007

Scaling (3) Segment by document, pros/cons:

Easy to partition (just MOD the docid) Easy to add new documents If machine fails, quality goes down but queries don’t die

Segment by term, pros/cons: Harder to partition (terms uneven) Trickier to add a new document (need to touch many machines)

If machine fails, search term might disappear, but not critical pages (e.g., yahoo.com/index.html)

Page 14: Intro to Web Search Michael J. Cafarella December 5, 2007

Intro to Nutch A search engine is more than just the query system. Simply obtaining the pages and constructing the index is a lot of work

Page 15: Intro to Web Search Michael J. Cafarella December 5, 2007

WebDB

Fetcher 2 of NFetcher 1 of N

Fetcher 0 of N

Fetchlist 2 of NFetchlist 1 of N

Fetchlist 0 of N

Update 2 of NUpdate 1 of N

Update 0 of N

Content 0 of NContent 0 of N

Content 0 of N

Indexer 2 of NIndexer 1 of N

Indexer 0 of N

Searcher 2 of N

Searcher 1 of N

Searcher 0 of N

WebServer 2 of MWebServer 1 of M

WebServer 0 of M

Index 2 of NIndex 1 of N

Index 0 of N

Inject

Page 16: Intro to Web Search Michael J. Cafarella December 5, 2007

Moving Parts Acquisition cycle

WebDB Fetcher

Index generation Indexing Link analysis (maybe)

Serving results

Page 17: Intro to Web Search Michael J. Cafarella December 5, 2007

WebDB Contains info on all pages, links

URL, last download, # failures, link score, content hash, ref counting

Source hash, target URL Must always be consistent Designed to minimize disk seeks

19ms seek time x 200m new pages/mo = ~44 days of disk seeks!

Page 18: Intro to Web Search Michael J. Cafarella December 5, 2007

Fetcher Fetcher is very stupid. Not a “crawler”

Divide “to-fetch list” into k pieces, one for each fetcher machine

URLs for one domain go to same list, otherwise random “Politeness” w/o inter-fetcher protocols

Can observe robots.txt similarly Better DNS, robots caching Easy parallelism

Two outputs: pages, WebDB edits

Page 19: Intro to Web Search Michael J. Cafarella December 5, 2007

2. Sort edits (externally, if necessary)

WebDB/Fetcher Updates

URL: http://www.flickr/com/index.html

LastUpdated: Never

ContentHash: None

URL: http://www.cnn.com/index.html

LastUpdated: Never

ContentHash: None

URL: http://www.yahoo/index.html

LastUpdated: 4/07/05

ContentHash: MD5_toewkekqmekkalekaa

URL: http://www.about.com/index.html

LastUpdated: 3/22/05

ContentHash: MD5_sdflkjweroiwelksdEdit: DOWNLOAD_CONTENT

URL: http://www.cnn.com/index.html

ContentHash: MD5_balboglerropewolefbag

Edit: DOWNLOAD_CONTENT

URL: http://www.yahoo/index.html

ContentHash: MD5_toewkekqmekkalekaa

Edit: NEW_LINK

URL: http://www.flickr.com/index.html

ContentHash: None

WebDB Fetcher edits

1. Write down fetcher edits3. Read streams in parallel, emitting new database4. Repeat for other tables

URL: http://www.cnn.com/index.html

LastUpdated: Today!

ContentHash: MD5_balboglerropewolefbag

URL: http://www.yahoo.com/index.html

LastUpdated: Today!

ContentHash: MD5_toewkekqmekkalekaa

Page 20: Intro to Web Search Michael J. Cafarella December 5, 2007

Indexing Iterate through all k page sets in parallel, constructing inverted index

Creates a “searchable document” like we saw earlier: URL text Content text Incoming anchor text

Inverted index provided by the Lucene open source project

Page 21: Intro to Web Search Michael J. Cafarella December 5, 2007

Administering Nutch Admin costs are critical

It’s a hassle when you have 25 machines

Google has maybe >100k Files

WebDB content, working files Fetchlists, fetched pages Link analysis outputs, working files Inverted indices

Jobs Emit fetchlists, fetch, update WebDB Run link analysis Build inverted indices

Page 22: Intro to Web Search Michael J. Cafarella December 5, 2007

Administering Nutch (2) Admin sounds boring, but it’s not! Really I swear

Large-file maintenance Google File System (Ghemawat, Gobioff, Leung)

Nutch Distributed File System Job Control

Map/Reduce (Dean and Ghemawat) Result: Hadoop, a Nutch spinoff

Page 23: Intro to Web Search Michael J. Cafarella December 5, 2007

Nutch Distributed File System Similar, but not identical, to GFS Requirements are fairly strange

Extremely large files Most files read once, from start to end

Low admin costs per GB Equally strange design

Write-once, with delete Single file can exist across many machines

Wholly automatic failure recovery

Page 24: Intro to Web Search Michael J. Cafarella December 5, 2007

NDFS (2) Data divided into blocks Blocks can be copied, replicated Datanodes hold and serve blocks Namenode holds metainfo

Filename block list Block datanode-location

Datanodes report in to namenode every few seconds

Page 25: Intro to Web Search Michael J. Cafarella December 5, 2007

NDFS File Read

Namenode

Datanode 0 Datanode 1 Datanode 2

Datanode 3 Datanode 4 Datanode 5

1.Client asks datanode for filename info

2.Namenode responds with blocklist, and location(s) for each block

3.Client fetches each block, in sequence, from a datanode

“crawl.txt”(block-33 / datanodes 1, 4)(block-95 / datanodes 0, 2)(block-65 / datanodes 1, 4, 5)

Page 26: Intro to Web Search Michael J. Cafarella December 5, 2007

NDFS Replication

Namenode

Datanode 0(33, 95)

Datanode 1(46, 95)

Datanode 2(33, 104)

Datanode 3(21, 33, 46)

Datanode 4(90)

Datanode 5(21, 90, 104)

1.Always keep at least k copies of each blk

2.Imagine datanode 4 dies; blk 90 lost

3.Namenode loses heartbeat, decrements blk 90’s reference count. Asks datanode 5 to replicate blk 90 to datanode 0

4.Choosing replication target is tricky

(Blk 90 to dn 0)

Page 27: Intro to Web Search Michael J. Cafarella December 5, 2007

Map/Reduce Map/Reduce is programming model from Lisp (and other places) Easy to distribute across nodes Nice retry/failure semantics

map(key, val) is run on each item in set emits key/val pairs

reduce(key, vals) is run for each unique key emitted by map() emits final output

Many problems can be phrased this way

Page 28: Intro to Web Search Michael J. Cafarella December 5, 2007

Map/Reduce (2) Task: count words in docs

Input consists of (url, contents) pairs

map(key=url, val=contents): For each word w in contents, emit (w, “1”)

reduce(key=word, values=uniq_counts):

Sum all “1”s in values list Emit result “(word, sum)”

Page 29: Intro to Web Search Michael J. Cafarella December 5, 2007

Map/Reduce (3) Task: grep

Input consists of (url+offset, single line)

map(key=url+offset, val=line): If contents matches regexp, emit (line, “1”)

reduce(key=line, values=uniq_counts):

Don’t do anything; just emit line

We can also do graph inversion, link analysis, WebDB updates, etc

Page 30: Intro to Web Search Michael J. Cafarella December 5, 2007

Map/Reduce (4) How is this distributed?

1. Partition input key/value pairs into chunks, run map() tasks in parallel

2. After all map()s are complete, consolidate all emitted values for each unique emitted key

3. Now partition space of output map keys, and run reduce() in parallel

If map() or reduce() fails, reexecute!

Page 31: Intro to Web Search Michael J. Cafarella December 5, 2007

Map/Reduce Job Processing

JobTracker

TaskTracker 0TaskTracker 1TaskTracker 2

TaskTracker 3TaskTracker 4TaskTracker 5

1.Client submits “grep” job, indicating code and input files

2.JobTracker breaks input file into k chunks, (in this case 6). Assigns work to ttrackers.

3.After map(), tasktrackers exchange map-output to build reduce() keyspace

4.JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work.

5.reduce() output may go to NDFS

“grep”

Page 32: Intro to Web Search Michael J. Cafarella December 5, 2007

Conclusion http://www.nutch.org/

Partial documentation Source code Developer discussion board

“Lucene in Action” by Hatcher, Gospodnetic

http://www.hadoop.org/ Or, take 490H