43
CS 331 - Data Mining 1 The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

  • View
    219

  • Download
    2

Embed Size (px)

Citation preview

Page 1: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 1

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Sergey Brin, Lawrence Page

Presented By: Paolo LimApril 10, 2007

Page 2: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 2

AKA: The Original Google Paper

Larry Page and Sergey Brin

Page 3: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 3

Presentation Outline

Design goals of Google search engineLink Analysis and other featuresSystem architecture and major structuresCrawling, indexing, and searching the webPerformance and resultsConclusionsFinal exam questions

Page 4: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 4

Linear Algebra Background

PageRank involves knowledge of: Matrix addition/multiplication Eigenvectors and Eigenvalues Power iteration Dot product

Not discussed in detail in presentation For reference:

http://cs.wellesley.edu/~cs249B/math/Linear%20Algebra/CS298LinAlgpart1.pdf

http://www.cse.buffalo.edu/~hungngo/classes/2005/Expanders/notes/LA-intro.pdf

Page 5: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 5

Google Design Goals

Scaling with the web’s growth Improved search quality

Number of documents increasing rapidly, but user’s ability to look at documents lags

Lots of “junk” results, little relevance Academic search engine research

Development and understanding in academic realm System that reasonable number of people can actually

use Support novel research activities of large-scale web

data by other researchers and students

Page 6: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 6

Link Analysis Basics

PageRank Algorithm A Top 10 IEEE ICDM data mining algorithm Large basis for ranking system (discussed later) Tries to incorporate ideas from academic

community (publishing and citations)

Anchor Text Analysis <a href=http://www.com> ANCHOR TEXT </a>

Page 7: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 7

Intuition: Why Links, Anyway?

Links represent citationsQuantity of links to a website makes the

website more popularQuality of links to a website also helps in

computing rankLink structure largely unused before Larry

Page proposed it to thesis advisor

Page 8: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 8

Naïve PageRank

Each link’s vote is proportional to the importance of its’ source page

If page P with important I has N outlinks, then each link gets I / N votes

Simple recursive formulation: PR(A) = PR(p1)/C(p1) + … + PR(pn)/C(pn) PR(X) PageRank of page X C(X) number of links going out of page X

Page 9: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 9

Naïve PageRank Model(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

The web in 1839

Yahoo

M’softAmazon

y

a m

y/2

y/2

a/2

a/2

m

y = y /2 + a /2a = y /2 + mm = a /2

Page 10: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 10

Solving the flow equations

3 equations, 3 unknowns, no constants No unique solution All solutions equivalent modulo scale factor

Additional constraint forces uniqueness y+a+m = 1 y = 2/5, a = 2/5, m = 1/5

Gaussian elimination method works for small examples, but we need a better method for large graphs

Page 11: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 11

Matrix formulation

Matrix M has one row and one column for each web page

Suppose page j has n outlinks If j ! i, then Mij=1/n Else Mij=0

M is a column stochastic matrix Columns sum to 1

Suppose r is a vector with one entry per web page ri is the importance score of page i Call it the rank vector

Page 12: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 12

Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Suppose page j links to 3 pages, including i

i

j

M r r

=i

1/3

Page 13: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 13

Eigenvector formulation

The flow equations can be written

r = MrSo the rank vector is an eigenvector of the

stochastic web matrix In fact, its first or principal eigenvector, with

corresponding eigenvalue 1

Page 14: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 14

Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

y = y /2 + a /2a = y /2 + mm = a /2

r = Mr

y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m

Page 15: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 15

Power Iteration

Simple iterative scheme (aka relaxation)Suppose there are N web pagesInitialize: r0 = [1,….,1]T

Iterate: rk+1 = Mrk

Stop when |rk+1 - rk|1 < |x|1 = 1·i·N|xi| is the L1 norm Can use any other vector norm e.g., Euclidean

Page 16: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 16

Power Iteration Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

ya =m

111

13/21/2

5/4 1 3/4

9/822/241/2

6/56/53/5

. . .

Page 17: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 17

Random Surfer

Imagine a random web surfer At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P

uniformly at random Ends up on some page Q linked from P Process repeats indefinitely

Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t p(t) is a probability distribution on pages

Page 18: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 18

The stationary distribution

Where is the surfer at time t+1? Follows a link uniformly at random p(t+1) = Mp(t)

Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t) Then p(t) is called a stationary distribution for the

random walk Our rank vector r satisfies r = Mr

So it is a stationary distribution for the random surfer

Page 19: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 19

Spider traps

A group of pages is a spider trap if there are no links from within the group to outside the groupRandom surfer gets trapped

Spider traps violate the conditions needed for the random walk theorem

Page 20: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 20

Microsoft becomes a spider trap (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 0m 0 1/2 1

y a m

ya =m

111

11/23/2

3/41/27/4

5/83/82

003

. . .

Page 21: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 21

Random teleports

The Google solution for spider trapsAt each time step, the random surfer has

two options: With probability , follow a link at random With probability 1-, jump to some page

uniformly at random Common values for are in the range 0.8 to 0.9

Surfer will teleport out of spider trap within a few time steps

Page 22: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 22

Matrix formulation

Suppose there are N pages Consider a page j, with set of outlinks O(j) We have Mij = 1/|O(j)| when j!i and Mij = 0

otherwise The random teleport is equivalent to

adding a teleport link from j to every other page with probability (1-)/N

reducing the probability of following each outlink from 1/|O(j)| to /|O(j)|

Equivalent: tax each page a fraction (1-) of its score and redistribute evenly

Page 23: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 23

Page Rank

Construct the NxN matrix A as follows Aij = Mij + (1-)/N

Verify that A is a stochastic matrixThe page rank vector r is the principal

eigenvector of this matrix satisfying r = Ar

Equivalently, r is the stationary distribution of the random walk with teleports

Page 24: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 24

Previous example with =0.8 (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

1/2 1/2 0 1/2 0 0 0 1/2 1

1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3

y 7/15 7/15 1/15a 7/15 1/15 1/15m 1/15 7/15 13/15

0.8 + 0.2

ya =m

111

1.000.601.40

0.840.601.56

0.7760.5361.688

7/11 5/1121/11

. . .

Page 25: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 25

Dead ends

Pages with no outlinks are “dead ends” for the random surfer Nowhere to go on next step

Page 26: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 26

Microsoft becomes a dead end (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

ya =m

111

10.60.6

0.7870.5470.387

0.6480.4300.333

000

. . .

1/2 1/2 0 1/2 0 0 0 1/2 0

1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3

y 7/15 7/15 1/15a 7/15 1/15 1/15m 1/15 7/15 1/15

0.8 + 0.2

Non-stochastic!

Page 27: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 27

Dealing with dead-ends

Teleport Follow random teleport links with probability 1.0

from dead-ends Adjust matrix accordingly

Prune and propagate Preprocess the graph to eliminate dead-ends Might require multiple passes Compute page rank on reduced graph Approximate values for dead ends by

propagating values from reduced graph

Page 28: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 28

Anchor Text

Can be more accurate description of target site than target site’s text itself

Can point at non-HTTP or non-text Images Videos Databases

Possible for non-crawled pages to be returned in the process

Page 29: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 29

Other Features

List of occurrences of a particular word in a particular document (Hit List)

Location information and proximityKeeps track of visual presentation details:

Font size of words Capitalization Bold/Italic/Underlined/etc.

Full raw HTML of all pages is available in repository

Page 30: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 30

Google Architecture (from http://www.ics.uci.edu/~scott/google.htm)

Implemented in C and C++ on Solaris and Implemented in C and C++ on Solaris and LinuxLinux

Page 31: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 31

Google Architecture(from http://www.ics.uci.edu/~scott/google.htm)

Keeps track of Keeps track of URLs that have URLs that have and need to be and need to be

crawledcrawled

Compresses and Compresses and stores web pagesstores web pages

Multiple crawlers run in Multiple crawlers run in parallel. Each crawler keeps parallel. Each crawler keeps its own DNS lookup cache its own DNS lookup cache

and ~300 open connections and ~300 open connections open at once.open at once.

Uncompresses and Uncompresses and parses documents. parses documents.

Stores link information Stores link information in anchors file.in anchors file.

Stores each link Stores each link and text and text

surrounding link.surrounding link.

Converts relative Converts relative URLs into absolute URLs into absolute

URLs.URLs.

Contains full html of every Contains full html of every web page. Each document is web page. Each document is

prefixed by docID, length, prefixed by docID, length, and URL.and URL.

Page 32: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 32

Google Architecture(from http://www.ics.uci.edu/~scott/google.htm)

Maps absolute URLs into docIDs stored in Maps absolute URLs into docIDs stored in Doc Index. Stores anchor text in Doc Index. Stores anchor text in

“barrels”. Generates database of links “barrels”. Generates database of links (pairs of docIds).(pairs of docIds).

Parses & distributes hit lists Parses & distributes hit lists into “barrels.”into “barrels.”

Creates inverted index Creates inverted index whereby document list whereby document list containing docID and containing docID and

hitlists can be retrieved hitlists can be retrieved given wordID.given wordID.

In-memory hash table that In-memory hash table that maps words to wordIds. maps words to wordIds.

Contains pointer to doclist Contains pointer to doclist in barrel which wordId falls in barrel which wordId falls

into. into.

Partially sorted forward Partially sorted forward indexes sorted by docID. indexes sorted by docID. Each barrel stores hitlists Each barrel stores hitlists

for a given range of for a given range of wordIDs.wordIDs.

DocID keyed index where each entry includes info such as pointer DocID keyed index where each entry includes info such as pointer to doc in repository, checksum, statistics, status, etc. Also to doc in repository, checksum, statistics, status, etc. Also

contains URL info if doc has been crawled. If not just contains URL.contains URL info if doc has been crawled. If not just contains URL.

Page 33: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 33

Google Architecture (from http://www.ics.uci.edu/~scott/google.htm)

List of wordIds List of wordIds produced by Sorter produced by Sorter

and lexicon created by and lexicon created by Indexer used to create Indexer used to create new lexicon used by new lexicon used by

searcher. Lexicon searcher. Lexicon stores ~14 million stores ~14 million

words.words.

New lexicon keyed by New lexicon keyed by wordID, inverted doc wordID, inverted doc

index keyed by docID, index keyed by docID, and PageRanks used and PageRanks used

to answer queriesto answer queries

2 kinds of barrels. 2 kinds of barrels. Short barrell which Short barrell which

contain hit list which contain hit list which include title or anchor include title or anchor hits. Long barrell for hits. Long barrell for

all hit lists.all hit lists.

Page 34: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 34

Google Query Evaluation

1. Parse the query. 2. Convert words into wordIDs. 3. Seek to the start of the doclist in the short barrel for every word. 4. Scan through the doclists until there is a document that matches

all the search terms. 5. Compute the rank of that document for the query. 6. If we are in the short barrels and at the end of any doclist, seek

to the start of the doclist in the full barrel for every word and go to step 4.

7. If we are not at the end of any doclist go to step 4.8. Sort the documents that have matched by rank and return the

top k.

Page 35: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 35

Single Word Query Ranking

Hitlist is retrieved for single word Each hit can be one of several types: title, anchor,

URL, large font, small font, etc. Each hit type is assigned its own weight Type-weights make up vector of weights Number of hits of each type is counted to form

count-weight vector Dot product of type-weight and count-weight

vectors is used to compute IR score IR score is combined with PageRank to compute

final rank

Page 36: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 36

Multi-word Query Ranking

Similar to single-word ranking except now must analyze proximity of words in a document

Hits occurring closer together are weighted higher than those farther apart

Each proximity relation is classified into 1 of 10 bins ranging from a “phrase match” to “not even close”

Each type and proximity pair has a type-prox weight Counts converted into count-weights Take dot product of count-weights and type-prox

weights to computer for IR score

Page 37: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 37

Scalability

Cluster architecture combined with Moore’s Law make for high scalability. At time of writing: ~ 24 million documents indexed in one week ~518 million hyperlinks indexed Four crawlers collected 100 documents/sec

Page 38: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 38

Key Optimization Techniques

Each crawler maintains its own DNS lookup cache Use flex to generate lexical analyzer with own stack for

parsing documents Parallelization of indexing phase In-memory lexicon Compression of repository Compact encoding of hit lists for space saving Indexer is optimized so it is just faster than the crawler

so that crawling is the bottleneck Document index is updated in bulk Critical data structures placed on local disk Overall architecture designed avoid to disk seeks

wherever possible

Page 39: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 39

Storage Requirements (from http://www.ics.uci.edu/~scott/google.htm)

At the time of publication, Google had the following statistical breakdown for storage requirements:

Page 40: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 40

Conclusions

Search is far from perfect Topic/Domain-specific PageRank Machine translation in search Non-hypertext search

Business potential Brin and Page worth around $15 billion each…

at 32 years old! If you have a better idea than how Google does

search, please remember me when you’re hiring software engineers!

Page 41: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 41

Possible Exam Questions

Given a web/link graph, formulate a Naïve PageRank link matrix and do a few steps of power iteration. Slides 14 – 16

What are spider traps and dead ends, and how does Google deal with these? Spider Trap: Slides 19 – 21 Dead End: Slides 25 – 27

Explain difference between single and multiple word search query evaluation. Slides 35 – 36

Page 42: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 42

References

Brin, Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Brin, Page, Motwani, Winograd. The PageRank Citation Ranking: Bringing Order to the Web.

http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf

www.cs.duke.edu/~junyang/courses/cps296.1-2002-spring/lectures/02-web-search.pdf

http://www.ics.uci.edu/~scott/google.htm

Page 43: CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 43

Thank you!