Google, Web Crawling, and Distributed Synchronization Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems April 1, 2008

Google, Web Crawling,and Distributed Synchronization

Zachary G. IvesUniversity of Pennsylvania

CIS 455 / 555 – Internet and Web Systems

April 1, 2008

2

Administrivia Homework due TONIGHT Please email me a list of project partners

by THURSDAY

3

Google Architecture [Brin/Page 98]

Focus was on scalabilityto the size of the Web

First to really exploitLink Analysis

Started as an academicproject @ Stanford;became a startup

Est. 450K commodityservers in 2006!!!

4

Google Document Repository“BigFile” distributed filesystem

(GFS) Support for 264 bytes across

multiple drives, filesystems Manages its own file descriptors,

resourcesRepository Contains every HTML page (this is

the cached page entry), compressed in zlib (faster than bzip)

Two indices: DocID documents URL checksum DocID

5

Lexicon The list of searchable words

(Presumably, today it’s used to suggest alternative words as well)

The “root” of the inverted index As of 1998, 14 million “words”

Kept in memory (was 256MB) Two parts:

Hash table of pointers to words and the “barrels” (partitions) they fall into

List of words (null-separated) Much larger today – may still fit

in memory??

6

Indices – Inverted and “Forward” Inverted index divided into

“barrels” (partitions by range) Indexed by the lexicon; for each

DocID, consists of a Hit List of entries in the document

Forward index uses the same barrels Used to find multi-word queries

with words in same barrel Indexed by DocID, then a list of

WordIDs in this barrel and this document, then Hit Lists corresponding to the WordIDs

Two barrels: short (anchor and title); full (all text)

Note: barrels are now called “shards”

hit hit hitDocIDhit hithit hit hithitDocIDhit hithit hit hithit

nhits: 8WordID: 24NULL

forward barrels: total 43 GB

NULL

nhits: 8WordID: 24

WordID: 24 nhits: 8WordID: 24

WordID: 24 nhits: 8

nhits: 8

ndocsWordIDndocsWordIDndocsWordID

Lexicon: 293 MB

nhits: 8nhits: 8nhits: 8nhits: 8 hit hit

Inverted Barrels: 41 GBhit hit hit hithit hit hithit hit hit hit

DocID: 27DocID: 27DocID: 27DocID: 27

original tables fromhttp://www.cs.huji.ac.il/~sdbi/2000/google/index.htm

7

Hit Lists (Not Mafia-Related) Used in inverted and forward indices Goal was to minimize the size – the bulk of

data is in hit entries For 1998 version, made it down to 2 bytes per

hit (though that’s likely climbed since then):

cap 1 font: 3 position: 12Plain

cap 1 font: 7type: 4 position: 8 Fancy

cap 1 font: 7type: 4 hash: 4 pos: 4 Anchor

vs.

special-cased to:

8

Google’s Distributed Crawler (in 1998) Single URL Server – the coordinator

A queue that farms out URLs to crawler nodes Implemented in Python!

Crawlers have 300 open connections apiece Each needs own DNS cache – DNS lookup is

major bottleneck Based on asynchronous I/O (which is now

supported in Java) Many caveats in building a “friendly”

crawler (we’ll talk about some momentarily)

9

Google’s Search Algorithm1. Parse the query2. Convert words into wordIDs3. Seek to start of doclist in the short barrel for every

word4. Scan through the doclists until there is a

document that matches all of the search terms5. Compute the rank of that document6. If we’re at the end of the short barrels, start at the

doclists of the full barrel, unless we have enough7. If not at the end of any doclist, goto step 48. Sort the documents by rank; return the top K

10

Ranking in GoogleConsiders many types of information:

Position, font size, capitalization Anchor text PageRank Count of occurrences (basically, TF) in a way

that tapers off (Not clear if they do IDF?)

Multi-word queries consider proximity also

11

Google’s ResourcesIn 1998:

24M web pages About 55GB data w/o repository About 110GB with repository Lexicon 293MB

Worked quite well with low-end PCBy 2006, > 25 billion pages, 400M queries/day:

Don’t attempt to include all barrels on every machine! e.g., 5+TB repository on special servers separate from index

servers Many special-purpose indexing services (e.g., images) Much greater distribution of data (2008, ~450K PCs?), huge

net BW Advertising needs to be tied in (100,000 advertisers claimed)

12

Web Crawling You’ve already built a basic crawler First, some politeness criteria Then we’ll look at the Mercator distributed

crawler as an example

13

Basic Crawler (Spider/Robot) Process Start with some initial page P0 Collect all URLs from P0 and add to the crawler

queue Consider <base href> tag, anchor links, optionally

image links, CSS, DTDs, scripts Considerations:

What order to traverse (polite to do BFS – why?) How deep to traverse What to ignore (coverage) How to escape “spider traps” and avoid cycles How often to crawl

14

Essential Crawler Etiquette Robot exclusion protocols First, ignore pages with:

<META NAME="ROBOTS” CONTENT="NOINDEX"> Second, look for robots.txt at root of web server

See http://www.robotstxt.org/wc/robots.html To exclude all robots from a server:

User-agent: *Disallow: /

To exclude one robot from two directories:User-agent: BobsCrawlerDisallow: /news/Disallow: /tmp/

http://www.robotstxt.org/wc/robots.html

15

Mercator A scheme for building a distributed web

crawler

Expands a “URL frontier” Avoids re-crawling same URLs

Also considers whether a document has been seen before Every document has signature/checksum info

computed as it’s crawled

16

Mercator Web Crawler

1. Dequeue frontier URL2. Fetch document3. Record into RewindStream

(RIS)4. Check against fingerprints to

verify it’s new

5. Extract hyperlinks6. Filter unwanted links7. Check if URL repeated

(compare its hash)8. Enqueue URL

17

Mercator’s Polite Frontier Queues Tries to go beyond breadth-first approach

– want to have only one crawler thread per server

Distributed URL frontier queue: One subqueue per worker thread The worker thread is determined by hashing

the hostname of the URL Thus, only one outstanding request per web server

18

Mercator’s HTTP Fetcher First, needs to ensure robots.txt is followed

Caches the contents of robots.txt for various web sites as it crawls them

Designed to be extensible to other protocols Had to write own HTTP requestor in Java –

their Java version didn’t have timeouts Today, can use setSoTimeout()

Can also use Java non-blocking I/O if you wish: http://www.owlmountain.com/tutorials/NonBlockingIo.htm But they use multiple threads and synchronous I/O

http://www.owlmountain.com/tutorials/NonBlockingIo.htm

19

Other Caveats Infinitely long URL names (good way to

get a buffer overflow!) Aliased host names Alternative paths to the same host Can catch most of these with signatures of

document data (e.g., MD5) Crawler traps (e.g., CGI scripts that link to

themselves using a different name) May need to have a way for human to override

certain URL paths – see Section 5 of paper

Mercator Document Statistics

PAGE TYPE PERCENTtext/html 69.2%image/gif 17.9%image/jpeg 8.1%text/plain 1.5%pdf 0.9%audio 0.4%zip 0.4%postscript 0.3%other 1.4%

Histogram of document sizes(60M pages)

21

Further Considerations May want to prioritize certain pages as

being most worth crawling Focused crawling tries to prioritize based on

relevance

May need to refresh certain pages more often

22

Web Search Summarized Two important factors:

Indexing and ranking scheme that allows most relevant documents to be prioritized highest

Crawler that manages to be (1) well-mannered, (2) avoid traps, (3) scale

We’ll be using Pastry to distribute the work of crawling and to distribute the data (what Google calls “barrels”)

23

SynchronizationThe issue:

What do we do when decisions need to be made in a distributed system? e.g., Who decides what action should be taken? Whose

conflicting option wins? Who gets killed in a deadlock? Etc.Some options:

Central decision point Distributed decision points with locking/scheduling

(“pessimistic”) Distributed decision points with arbitration

(“optimistic”) Transactions (to come)

24

Multi-process Decision Arbitration: Mutual Exclusion (i.e., Locking)

a) Process 1 asks the coordinator for permission to enter a critical region. Permission is grantedb) Process 2 then asks permission to enter the same critical region. The coordinator does not

reply.c) When process 1 exits the critical region, it tells the coordinator, when then replies to 2

25

Distributed Locking: Still Based on a Central Arbitrator

a) Two processes want to enter the same critical region at the same moment.b) Process 0 has the lowest timestamp, so it wins.c) When process 0 is done, it sends an OK also, so 2 can now enter the critical

region.

26

Eliminating the Request/Response: A Token Ring Algorithm

a) An unordered group of processes on a network.

b) A logical ring constructed in software.

27

Comparison of Costs

Algorithm Messages per entry/exit

Delay before entry (in message times) Problems

Centralized 3 2 Coordinator crash

Distributed 2 ( n – 1 ) 2 ( n – 1 ) Crash of any process

Token ring 1 to 0 to n – 1 Lost token, process crash

28

Electing a Leader: The Bully Algorithm Is Won By Highest ID

Suppose the old leader dies… Process 4 needs a decision; it runs an election Process 5 and 6 respond, telling 4 to stop Now 5 and 6 each hold an election

29

Totally-Ordered Multicasting: Arbitrates via Priority Updating a replicated database and leaving it in an

inconsistent state.

30

Another Approach: Time We can choose the decision that

happened first

… But how do we know what happened first?

Documents

Google, Web Crawling, and Distributed Synchronization Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems April 1, 2008