CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007

CS 331 - Data Mining 1

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Sergey Brin, Lawrence Page

Presented By: Paolo LimApril 10, 2007


AKA: The Original Google Paper

Larry Page and Sergey Brin


Presentation Outline

Design goals of Google search engineLink Analysis and other featuresSystem architecture and major structuresCrawling, indexing, and searching the webPerformance and resultsConclusionsFinal exam questions


Linear Algebra Background

PageRank involves knowledge of: Matrix addition/multiplication Eigenvectors and Eigenvalues Power iteration Dot product

Not discussed in detail in presentation For reference:

http://cs.wellesley.edu/~cs249B/math/Linear%20Algebra/CS298LinAlgpart1.pdf

http://www.cse.buffalo.edu/~hungngo/classes/2005/Expanders/notes/LA-intro.pdf


Google Design Goals

Scaling with the web’s growth Improved search quality

Number of documents increasing rapidly, but user’s ability to look at documents lags

Lots of “junk” results, little relevance Academic search engine research

Development and understanding in academic realm System that reasonable number of people can actually

use Support novel research activities of large-scale web

data by other researchers and students


Link Analysis Basics

PageRank Algorithm A Top 10 IEEE ICDM data mining algorithm Large basis for ranking system (discussed later) Tries to incorporate ideas from academic

community (publishing and citations)

Anchor Text Analysis <a href=http://www.com> ANCHOR TEXT </a>


Intuition: Why Links, Anyway?

Links represent citationsQuantity of links to a website makes the

website more popularQuality of links to a website also helps in

computing rankLink structure largely unused before Larry

Page proposed it to thesis advisor


Naïve PageRank

Each link’s vote is proportional to the importance of its’ source page

If page P with important I has N outlinks, then each link gets I / N votes

Simple recursive formulation: PR(A) = PR(p1)/C(p1) + … + PR(pn)/C(pn) PR(X) PageRank of page X C(X) number of links going out of page X


Naïve PageRank Model(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

The web in 1839

Yahoo

M’softAmazon

y

a m

y/2

y/2

a/2

a/2

m

y = y /2 + a /2a = y /2 + mm = a /2


Solving the flow equations

3 equations, 3 unknowns, no constants No unique solution All solutions equivalent modulo scale factor

Additional constraint forces uniqueness y+a+m = 1 y = 2/5, a = 2/5, m = 1/5

Gaussian elimination method works for small examples, but we need a better method for large graphs


Matrix formulation

Matrix M has one row and one column for each web page

Suppose page j has n outlinks If j ! i, then Mij=1/n Else Mij=0

M is a column stochastic matrix Columns sum to 1

Suppose r is a vector with one entry per web page ri is the importance score of page i Call it the rank vector


Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Suppose page j links to 3 pages, including i

i

j

M r r

=i

1/3


Eigenvector formulation

The flow equations can be written

r = MrSo the rank vector is an eigenvector of the

stochastic web matrix In fact, its first or principal eigenvector, with

corresponding eigenvalue 1


Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

y = y /2 + a /2a = y /2 + mm = a /2

r = Mr

y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m


Power Iteration

Simple iterative scheme (aka relaxation)Suppose there are N web pagesInitialize: r0 = [1,….,1]T

Iterate: rk+1 = Mrk

Stop when |rk+1 - rk|1 < |x|1 = 1·i·N|xi| is the L1 norm Can use any other vector norm e.g., Euclidean


Power Iteration Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

ya =m

111

13/21/2

5/4 1 3/4

9/822/241/2

6/56/53/5

. . .


Random Surfer

Imagine a random web surfer At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P

uniformly at random Ends up on some page Q linked from P Process repeats indefinitely

Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t p(t) is a probability distribution on pages


The stationary distribution

Where is the surfer at time t+1? Follows a link uniformly at random p(t+1) = Mp(t)

Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t) Then p(t) is called a stationary distribution for the

random walk Our rank vector r satisfies r = Mr

So it is a stationary distribution for the random surfer


Spider traps

A group of pages is a spider trap if there are no links from within the group to outside the groupRandom surfer gets trapped

Spider traps violate the conditions needed for the random walk theorem


Microsoft becomes a spider trap (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 0m 0 1/2 1

y a m

ya =m

111

11/23/2

3/41/27/4

5/83/82

003

. . .


Random teleports

The Google solution for spider trapsAt each time step, the random surfer has

two options: With probability , follow a link at random With probability 1-, jump to some page

uniformly at random Common values for are in the range 0.8 to 0.9

Surfer will teleport out of spider trap within a few time steps


Matrix formulation

Suppose there are N pages Consider a page j, with set of outlinks O(j) We have Mij = 1/|O(j)| when j!i and Mij = 0

otherwise The random teleport is equivalent to

adding a teleport link from j to every other page with probability (1-)/N

reducing the probability of following each outlink from 1/|O(j)| to /|O(j)|

Equivalent: tax each page a fraction (1-) of its score and redistribute evenly


Page Rank

Construct the NxN matrix A as follows Aij = Mij + (1-)/N

Verify that A is a stochastic matrixThe page rank vector r is the principal

eigenvector of this matrix satisfying r = Ar

Equivalently, r is the stationary distribution of the random walk with teleports


Previous example with =0.8 (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

1/2 1/2 0 1/2 0 0 0 1/2 1

1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3

y 7/15 7/15 1/15a 7/15 1/15 1/15m 1/15 7/15 13/15

0.8 + 0.2

ya =m

111

1.000.601.40

0.840.601.56

0.7760.5361.688

7/11 5/1121/11

. . .


Dead ends

Pages with no outlinks are “dead ends” for the random surfer Nowhere to go on next step


Microsoft becomes a dead end (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)

Yahoo

M’softAmazon

ya =m

111

10.60.6

0.7870.5470.387

0.6480.4300.333

000

. . .

1/2 1/2 0 1/2 0 0 0 1/2 0

1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3

y 7/15 7/15 1/15a 7/15 1/15 1/15m 1/15 7/15 1/15

0.8 + 0.2

Non-stochastic!


Dealing with dead-ends

Teleport Follow random teleport links with probability 1.0

from dead-ends Adjust matrix accordingly

Prune and propagate Preprocess the graph to eliminate dead-ends Might require multiple passes Compute page rank on reduced graph Approximate values for dead ends by

propagating values from reduced graph


Anchor Text

Can be more accurate description of target site than target site’s text itself

Can point at non-HTTP or non-text Images Videos Databases

Possible for non-crawled pages to be returned in the process


Other Features

List of occurrences of a particular word in a particular document (Hit List)

Location information and proximityKeeps track of visual presentation details:

Font size of words Capitalization Bold/Italic/Underlined/etc.

Full raw HTML of all pages is available in repository


Google Architecture (from http://www.ics.uci.edu/~scott/google.htm)

Implemented in C and C++ on Solaris and Implemented in C and C++ on Solaris and LinuxLinux


Google Architecture(from http://www.ics.uci.edu/~scott/google.htm)

Keeps track of Keeps track of URLs that have URLs that have and need to be and need to be

crawledcrawled

Compresses and Compresses and stores web pagesstores web pages

Multiple crawlers run in Multiple crawlers run in parallel. Each crawler keeps parallel. Each crawler keeps its own DNS lookup cache its own DNS lookup cache

and ~300 open connections and ~300 open connections open at once.open at once.

Uncompresses and Uncompresses and parses documents. parses documents.

Stores link information Stores link information in anchors file.in anchors file.

Stores each link Stores each link and text and text

surrounding link.surrounding link.

Converts relative Converts relative URLs into absolute URLs into absolute

URLs.URLs.

Contains full html of every Contains full html of every web page. Each document is web page. Each document is

prefixed by docID, length, prefixed by docID, length, and URL.and URL.


Google Architecture(from http://www.ics.uci.edu/~scott/google.htm)

Maps absolute URLs into docIDs stored in Maps absolute URLs into docIDs stored in Doc Index. Stores anchor text in Doc Index. Stores anchor text in

“barrels”. Generates database of links “barrels”. Generates database of links (pairs of docIds).(pairs of docIds).

Parses & distributes hit lists Parses & distributes hit lists into “barrels.”into “barrels.”

Creates inverted index Creates inverted index whereby document list whereby document list containing docID and containing docID and

hitlists can be retrieved hitlists can be retrieved given wordID.given wordID.

In-memory hash table that In-memory hash table that maps words to wordIds. maps words to wordIds.

Contains pointer to doclist Contains pointer to doclist in barrel which wordId falls in barrel which wordId falls

into. into.

Partially sorted forward Partially sorted forward indexes sorted by docID. indexes sorted by docID. Each barrel stores hitlists Each barrel stores hitlists

for a given range of for a given range of wordIDs.wordIDs.

DocID keyed index where each entry includes info such as pointer DocID keyed index where each entry includes info such as pointer to doc in repository, checksum, statistics, status, etc. Also to doc in repository, checksum, statistics, status, etc. Also

contains URL info if doc has been crawled. If not just contains URL.contains URL info if doc has been crawled. If not just contains URL.


Google Architecture (from http://www.ics.uci.edu/~scott/google.htm)

List of wordIds List of wordIds produced by Sorter produced by Sorter

and lexicon created by and lexicon created by Indexer used to create Indexer used to create new lexicon used by new lexicon used by

searcher. Lexicon searcher. Lexicon stores ~14 million stores ~14 million

words.words.

New lexicon keyed by New lexicon keyed by wordID, inverted doc wordID, inverted doc

index keyed by docID, index keyed by docID, and PageRanks used and PageRanks used

to answer queriesto answer queries

2 kinds of barrels. 2 kinds of barrels. Short barrell which Short barrell which

contain hit list which contain hit list which include title or anchor include title or anchor hits. Long barrell for hits. Long barrell for

all hit lists.all hit lists.


Google Query Evaluation

1. Parse the query. 2. Convert words into wordIDs. 3. Seek to the start of the doclist in the short barrel for every word. 4. Scan through the doclists until there is a document that matches

all the search terms. 5. Compute the rank of that document for the query. 6. If we are in the short barrels and at the end of any doclist, seek

to the start of the doclist in the full barrel for every word and go to step 4.

7. If we are not at the end of any doclist go to step 4.8. Sort the documents that have matched by rank and return the

top k.


Single Word Query Ranking

Hitlist is retrieved for single word Each hit can be one of several types: title, anchor,

URL, large font, small font, etc. Each hit type is assigned its own weight Type-weights make up vector of weights Number of hits of each type is counted to form

count-weight vector Dot product of type-weight and count-weight

vectors is used to compute IR score IR score is combined with PageRank to compute

final rank


Multi-word Query Ranking

Similar to single-word ranking except now must analyze proximity of words in a document

Hits occurring closer together are weighted higher than those farther apart

Each proximity relation is classified into 1 of 10 bins ranging from a “phrase match” to “not even close”

Each type and proximity pair has a type-prox weight Counts converted into count-weights Take dot product of count-weights and type-prox

weights to computer for IR score


Scalability

Cluster architecture combined with Moore’s Law make for high scalability. At time of writing: ~ 24 million documents indexed in one week ~518 million hyperlinks indexed Four crawlers collected 100 documents/sec


Key Optimization Techniques

Each crawler maintains its own DNS lookup cache Use flex to generate lexical analyzer with own stack for

parsing documents Parallelization of indexing phase In-memory lexicon Compression of repository Compact encoding of hit lists for space saving Indexer is optimized so it is just faster than the crawler

so that crawling is the bottleneck Document index is updated in bulk Critical data structures placed on local disk Overall architecture designed avoid to disk seeks

wherever possible


Storage Requirements (from http://www.ics.uci.edu/~scott/google.htm)

At the time of publication, Google had the following statistical breakdown for storage requirements:


Conclusions

Search is far from perfect Topic/Domain-specific PageRank Machine translation in search Non-hypertext search

Business potential Brin and Page worth around $15 billion each…

at 32 years old! If you have a better idea than how Google does

search, please remember me when you’re hiring software engineers!


Possible Exam Questions

Given a web/link graph, formulate a Naïve PageRank link matrix and do a few steps of power iteration. Slides 14 – 16

What are spider traps and dead ends, and how does Google deal with these? Spider Trap: Slides 19 – 21 Dead End: Slides 25 – 27

Explain difference between single and multiple word search query evaluation. Slides 35 – 36


References

Brin, Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Brin, Page, Motwani, Winograd. The PageRank Citation Ranking: Bringing Order to the Web.

http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf

www.cs.duke.edu/~junyang/courses/cps296.1-2002-spring/lectures/02-web-search.pdf

http://www.ics.uci.edu/~scott/google.htm


Thank you!

Documents

CS 331 - Data Mining1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007