Google Cluster Innards

Google Cluster InnardsMartin Dvorak

[email protected]://www.e-mental.com/dvorka

Agenda

• Inventing GoogleThe Anatomy of a Large-Scale Hypertextual Web Search Engine

Sergey Brin and Lawrence Page (founders) / 1998

• Cluster AnatomyWeb Search for a Planet: The Google Cluster Architecture

Luiz André Barroso, Jeffrey Dean and Urs Hoelzle / 2003Google's secret of success? Dealing with failure

Urs Hoelzle (Vice President of Engineering and Operations) / 2004

• Programming for Google ClusterMapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat (Google staff) / 2004

Inventing Google

Inventing Google

• Sergey & Larry - Ph.D. students at Stanford University

• Prototype (1998)– http://google.stanford.edu– 24,000,000 pages (8,058,044,651 today)

• Google– “We chose our system name, Google, because it is a

common spelling of googol, or 10100 and fits well with our goal of building very large-scale search engines.”

• Page Rank– An objective measure of its citation importance that

corresponds well with people’s subjective idea of importance.

Inventing Google: Foundation• PageRank*:

– We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d... Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

A

T1

Tn

…

C1

Cn

*) Larry Page

Inventing Google: Foundation

• Page Rank formula informally– PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))– PageRank can be thought of as a model of user behavior. – We assume there is a "random surfer" who is given a web page

at random and keeps clicking on links, never hitting "back" but eventually gets bored and starts on another random page.

– The probability that the random surfer visits a page is its PageRank.

• High PR has a page if…– there are many pages that point to it– or if there are some pages that point to it and have a high PR

– Note recursive weight propagation through web link structure.– Note that the PageRanks form a probability distribution over web

pages, so the sum of all web pages’ PageRanks will be one.– Damping factor d is the probability at each page the "random

surfer" will get bored and request another random page. • Personalization

Inventing Google: Foundation

• PageRank relevancy tuning– Page title– Anchor text– Meta– Font

• Size• Weight

– Capitalization– …

Inventing Google: Anatomy

Inventing Google: Anatomy

• URL Server– Providers list of URLs to be fetched to crawlers

• Google Crawlers (GoogleBot)– Multiple distributed crawlers

• Own DNS cache• 300 connections open at once

– Send fetched pages to Store Server– Originally written in Python

• Store Server– Compresses and stores files to repository.– DOCID is created for each page.

• Repository– Stores fetched pages for further processing by Indexer

Inventing Google: Anatomy• Indexer

– Reads pages from Repository (uncompress)– Parses each document (Flex on top of own stack):

• Page converted to set of Hits (position, font, capitalization, title/achor/meta) / 2B• Added to Document Index

– Hits are distributed to Barrels (i.e. one document to multiple barrels)– Every link found in page is stored to Anchors file

• Forward and Inverted Barrels (2*64)– Forward Index

• Barrel keeps range of Hits sorted by DOCIDs• (DOCID, (WORDID, word’s Hit reference+)+)

– Processed by Sorter:• Generates inverted index from forward index – sorts Hits by WORDIDs• Creates (WORDID, offsets) used by Lexicon

– Inverted Index (short/full)• (WORDID, (DOCID reference, Hit list reference)+))• Short: DOCIDs sorted by/contains just quality Hits (word in title, anchor,...); optimal single word search• Full: DOCIDs sorted by DOCID; optimal Hit lists merging i.e. multi-word search

• Anchors file– Anchor (from, to, text)

• URL Resolver– Reads anchors file:

• Relation 2 absolute URL conversion + DOCID assignment• Creates links file

• Links file– (url, target: DOCID)

Inventing Google: Anatomy• Searcher uses…

– Lexicon• Keeps map saying which Barrel to use.• Originally kept in memory (256MB).

– IMHO now must be used something like Multi-level VM Page Table

– It is is/was of fixed size (14,000,000 words)

– Barrels• Each barrel keeps range of WORDIDs• WORID 2 DOCID map

– PageRank pool• Keeps counted page rank for each DOCID

– Doc Index• DOCID ordered information about each document

– (DOCID, status, repository pointer, checksum, stat, URL, title)

Cluster Innards

Cluster Innards: Global Google

• Over 30 Google clusters around the world.– DNS based & geo location driven load-balancing:

• Domain Name: GOOGLE.COMRegistrar: ALLDOMAINS.COM INC. Whois Server: whois.alldomains.com Referral URL: http://www.alldomains.com Name Server: NS2.GOOGLE.COM Name Server: NS1.GOOGLE.COM Name Server: NS3.GOOGLE.COM Name Server: NS4.GOOGLE.COM Status: REGISTRAR-LOCK Updated Date: 03-oct-2002 Creation Date: 15-sep-1997 Expiration Date: 14-sep-2011

• 2005, May 7: Google DNS hack speculations

• Total PCs• > 5,000 in 2000• >15,000 in 2003 • >79,000* in 2004

*) I’m not sure about this number, it was taken from an external resource.

Cluster Innards: HW

• Basics cluster design insights– Reliability in SW rather then server-class HW.

• Commodity PCs used to build high-end computing cluster at a low end prices.

• Example:– $287,000 – 176x 2GHz Xeon, 176GB RAM, 7TB HDD– $758,000 – 8x 2GHZ Xeon, 64GB RAM, 8TB HDD

– Design is tailored for best aggregate request throughput, not peak server response time – individual request parallelization.

• Google has inexpensively built out its computing infrastructure by using thousands of "commodity" servers – <2,000 servers in single cluster.– Dual-processor x86 servers (starting at 533MHz Celeron) with 2-4

GB of memory per machine, 1+ 80GB IDE drive.– Rack: 40-80 of x86-based servers.

Cluster Innards: HW

• Optimistically, a consumer PC might crash once in three years from a software glitch or hardware problem. – "At Google scale...if you have thousands of PCs, you can expect

one (failure) a day,…"

• 1,000,000s not 1,000,000,000s of dollars.– “The trick is to make these racks of hardware work together and

to ensure that the failure of one machine doesn't derail an operation.”

• Switched Ethernet– Commodity networking hardware is used - typically either 100

megabits/second or 1 gigabit/second at the machine level, but averaging considerably less in overall bisection bandwidth.

– Locality optimizations (GFS)

Cluster Innards: SW

• Stripped-down version of Linux, which is based on the Red Hat distribution but is really just the operating system kernel modified for Google.

• Google File System is optimized for handling large blocks of data.– 64MB block– The file system was designed to assume that a failure, such as a failed

disk or unplugged network cable, can happen at any time. – Data is replicated in three places, and there is a "master" machine that

can locate copies of a piece of data, such as a keyword index, if the original is out of commission.

• Google has created "batch" job scheduling software that acts as a sort of taskmaster for millions of operations called the Global Work Queue.

• Another important engineering feat done by Google is to make writing programs that run across thousands of servers very straightforward…

Programming for Cluster

Programming For Cluster

• Google's MapReduce is a programming model and an associated implementation for processing and generating large data sets. – Automates the task of recovering a program in case of a failure.– It is critical to keeping the company's costs down.

• MR in brief:– Users specify a map function that processes a key/value pair to

generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

• Features– Functional style programming.– Automatic parallelization.


• Map Reduce runtime…– takes care of the details of partitioning the input data– scheduling the program's execution across a set of machines,

handling machine failures– managing the required inter-machine communication.– …and more.

• MR hides machines the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.– Therefore even programmers without any experience with

parallel and distributed systems can easily utilize the resources of a large distributed system.

• Numbers…– TB of data processed (>20)– On 1,000s of machines– 100s of MR programs in place


• Special purpose computations examples:– Inputs:

• crawled documents • web request logs• …

– Outputs: • inverted indices• various representations of the graph structure of web

documents• summaries of the number of pages crawled per host

(dumper)• the set of most frequent queries in a given day, etc.


• LISP roots– Remind map and reduce primitives:

• map(func, list, ...) – Creates new list from the results of applying func to each element of each list.

There must be one list per argument to the function.– map(lambda x, y: x+y, [1,2],[3,4]) --> [4,6]– map(None, [1,2],[3,4]) --> [[1,3],[2,4]]

• reduce(func, list {,init}) – Applies func to each pair of items in turn. The results are accumulated.– reduce(lambda x, y: x+y, [1,2,3,4],5) --> 15 – reduce(lambda x, y: x&y, [1,0,1]) --> 0 – reduce(None, [], 1) --> 1

• Programming model– Key/values 2 key/values– Map & reduce functions written by user are linked with MR library.

• map (k1,v1) list(k2,v2)• reduce (k2,list(v2)) list(v2)

– Input/output file, tuning parameters, …


• Example (pseudocode):

– map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");

– reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));


• Example solves the problem of counting the number of occurrences of each word in a large collection of documents.

• More examples:– Distributed Grep

• Map: (URL, true); Reduce: id– Count of URL Access Frequency

• Access page log is used as input for map.• Map: (URL,1); Reduce: (URL; total count)

– Reverse Web-Link Graph• Link database is processed by map.• Map: (target; source); Reduce: (target; list(source))

– Term-Vector per Host • A term vector summarizes the most important words that occur in a

document or a set of documents as a list of (word; frequency) pairs.• Hostname is determined for each document by map and term vector created

(so there is multiple entries); reduce function then merges all entries associated with particular host and throws away infrequent terms.

• Map: (hostname; term vector); Reduce: (hostname; term vector)


Programming For Cluster• Also…

– Multiple tasks performed by single worker (load-balancing)

– Master • idle, in-progress, completed (workers)• worker failure (re-execution)• master failure (rare, restart)

– MR locality optimization• Save network bandwidth• GFS 64MB block replication• MR scheduler takes GFS replicas location into account

– Stragglers• Overload, HW problems, etc.• Backup task – fork twin execution for straggler

– Task granularity• Number of workers driven by M&R pairs number• For example: M=200,000; R=5,000 using 2,000 workers

Putting Things Together

I’m Feeling Lucky1. Pre-phase:

– Browser requests e.g. http://www.google.com/search?q=edu+session– DNS-based load-balancing selects cluster according to the geographical location

of the user & actual cluster utilization– The rest of the evaluation is entirely local to the that cluster

2. Phase 1: Index servers...– Parse the query

• Perform spell-check and fork Ad task• Convert words into WORDIDs

– Choose inverted Barrel(s) using Lexicon• Barrel index is formed by number of servers whose data are randomly distributed and

replicated (full index/index shards) so search is highly parallelizable– Inverted barrel maps each query word to a matching list of documents (Hit list)

1. Seek to the start of the document list in the short barrel for every word (multiple tasks)2. Scan through document list until there is document that matches all search terms3. If we are in the short barrels and at the end of any document list, seek to the start of the

document list to the full barrel for every word and go to the step 14. If we are not at the end of any document list, go to the step 1

– Sort the DOCIDs that have matched3. Phase 2: Document servers...

– For each DOCID compute actual title, URL and query-specific document summary (matched words context).

– Document servers are used to dispatch this completion – also documents are randomly distributed and replicated, so the completion is highly parallelizable

Bonus

Stanford lab (around 1996)

The Original Google Storage: 10x4GB (1996)

Google San Francisco (2004)

A cluster of coolness Google History

Google Results Page Per Day

References• Sergey Brin, Lawrence Page; The Anatomy of a Large-Scale Hypertextual Web

Search Engine; 1998• Luiz André Barroso, Jeffrey Dean and Urs Hoelzle:Web Search for a Planet: The

Google Cluster Architecture; 2003• Urs Hoelzle:Google's secret of success? Dealing with failure; 2004• Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on

Large Clusters; 2004• Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: The Google File System;

2003• http://www.google.com

– GoogleBothttp://www.google.com/bot.html

• http://www.googleguide.com• http://www.searchengineposition.com• http://www.google-watch.org• Uniquely Google ™

http://www-db.stanford.edu/pub/voy/museum/google.htm• Stanford Gadgets

http://www-db.stanford.edu/pub/voy/museum/pictures/• Google hacked?

http://www.gigaom.com/2005/05/07/google-hacked/

Technology

Google Cluster Innards