Google Query-‐Serving Architecture

1/20/11

1

CS 61C: Great Ideas in Computer Architecture (Machine Structures)

Instructors: Randy H. Katz

David A. PaGerson hGp://inst.eecs.Berkeley.edu/~cs61c/sp11

1 Apeinf 2011 -‐-‐ Lecture #2 1/20/11

Review •  CS61c: Learn 5 great ideas in computer architecture to enable high performance programming via parallelism, not just learn C 1.  Layers of RepresentaVon/InterpretaVon 2.  Moore’s Law 3.  Principle of Locality/Memory Hierarchy 4.  Parallelism 5.  Performance Measurement and Improvement 6.  Dependability via Redundancy

•  Post PC Era: Parallel processing, smart phones to WSC •  WSC SW must cope with failures, varying load, varying HW latency bandwidth

•  WSC HW sensiVve to cost, energy efficiency 2 Spring 2011 -‐-‐ Lecture #1 1/20/11

New-‐School Machine Structures (It’s a bit more complicated!)

•  Parallel Requests Assigned to computer e.g., Search “Katz”

•  Parallel Threads Assigned to core e.g., Lookup, Ads

•  Parallel InstrucVons >1 instrucVon @ one Vme e.g., 5 pipelined instrucVons

•  Parallel Data >1 data item @ one Vme e.g., Add of 4 pairs of words

•  Hardware descripVons All gates @ one Vme

1/20/11 Spring 2011 -‐-‐ Lecture #1 3

Smart Phone

Warehouse Scale

Computer

So#ware Hardware

Harness Parallelism & Achieve High Performance

Logic Gates

Core Core …

Memory (Cache)

Input/Output

Computer

Main Memory

Core

InstrucVon Unit(s) FuncVonal Unit(s)

A3+B3 A2+B2 A1+B1 A0+B0

Today’s Lecture Agenda

•  Request and Data Level Parallelism •  Administrivia + 61C in the News + Internship Workshop + The secret to gejng good grades at Berkeley

•  MapReduce •  MapReduce Examples •  Technology Break •  Costs in Warehouse Scale Computer (if Vme permits)

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 4

Request-‐Level Parallelism (RLP)

•  Hundreds or thousands of requests per second – Not your laptop or cell-‐phone, but popular Internet services like Google search

–  Such requests are largely independent •  Mostly involve read-‐only databases

•  LiGle read-‐write (aka “producer-‐consumer”) sharing •  Rarely involve read–write data sharing or synchronizaVon across requests

•  ComputaVon easily parVVoned within a request and across different requests

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 5

Google Query-‐Serving Architecture

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 6

1/20/11

2

Anatomy of a Web Search

•  Google “Randy H. Katz” – Direct request to “closest” Google Warehouse Scale Computer

–  Front-‐end load balancer directs request to one of many arrays (cluster of servers) within WSC

– Within array, select one of many Google Web Servers (GWS) to handle the request and compose the response pages

– GWS communicates with Index Servers to find documents that contain the search words, “Randy”, “Katz”, uses locaVon of search as well

–  Return document list with associated relevance score

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 7


•  In parallel, – Ad system: books by Katz at Amazon.com –  Images of Randy Katz

•  Use docids (document IDs) to access indexed documents

•  Compose the page –  Result document extracts (with keyword in context) ordered by relevance score

–  Sponsored links (along the top) and adverVsements (along the sides)

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 8

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 9


•  ImplementaVon strategy – Randomly distribute the entries – Make many copies of data (aka “replicas”) – Load balance requests across replicas

•  Redundant copies of indices and documents – Breaks up hot spots, e.g., “JusVn Bieber” –  Increases opportuniVes for request-‐level parallelism

– Makes the system more tolerant of failures

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 10

Data-‐Level Parallelism (DLP)

•  2 kinds – Lots of data in memory that can be operated on in parallel (e.g., adding together 2 arrays)

– Lots of data on many disks that can be operated on in parallel (e.g., searching for documents)

•  March 1 lecture and 3rd project does Data Level Parallelism (DLP) in memory

•  Today’s lecture and 1st project does DLP across 1000s of servers and disks using MapReduce

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 11

Problem Trying To Solve

•  How process large amounts of raw data (crawled documents, request logs, …) every day to compute derived data (inverted indicies, page popularity, …) when computaVon conceptually simple but input data large and distributed across 100s to 1000s of servers so that finish in reasonable Vme?

•  Challenge: Parallelize computaVon, distribute data, tolerate faults without obscuring simple computaVon with complex code to deal with issues

•  Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Communica:ons of the ACM, Jan 2008.

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 12

1/20/11

3

MapReduce SoluVon

•  Apply Map funcVon to user supplier record of key/value pairs

•  Compute set of intermediate key/value pairs •  Apply Reduce operaVon to all values that share same key in order to combine derived data properly

•  User supplies Map and Reduce operaVons in funcVonal model so can parallelize, using re-‐execuVon for fault tolerance

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 13

Data-‐Parallel “Divide and Conquer” (MapReduce Processing)

•  Map: –  Slice data into “shards” or “splits”, distribute these to workers, compute sub-‐problem soluVons

–  map(in_key,in_value)->list(out_key,intermediate value)!•  Processes input key/value pair •  Produces set of intermediate pairs

•  Reduce: –  Collect and combine sub-‐problem soluVons –  reduce(out_key,list(intermediate_value))->list(out_value)!

•  Combines all intermediate values for a parVcular key •  Produces a set of merged output values (usually just one)

•  Fun to use: focus on problem, let MapReduce library deal with messy details

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 14

MapReduce ExecuVon

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 15

Fine granularity tasks: many more map tasks than machines

2000 servers => ≈ 200,000 Map Tasks, ≈ 5,000 Reduce tasks

MapReduce Popularity at Google

Aug-04 Mar-06 Sep-07 Sep-09 Number of MapReduce jobs 29,000 171,000 2,217,000 3,467,000 Average completion time (secs) 634 874 395 475 Server years used 217 2,002 11,081 25,562 Input data read (TB) 3,288 52,254 403,152 544,130 Intermediate data (TB) 758 6,743 34,774 90,120

Output data written (TB) 193 2,970 14,018 57,520 Average number servers /job 157 268 394 488

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 16

Google Uses MapReduce For …

•  ExtracVng the set of outgoing links from a collecVon of HTML documents and aggregaVng by target document

•  SVtching together overlapping satellite images to remove seams and to select high-‐quality imagery for Google Earth

•  GeneraVng a collecVon of inverted index files using a compression scheme tuned for efficient support of Google search queries

•  Processing all road segments in the world and rendering map Vle images that display these segments for Google Maps

•  Fault-‐tolerant parallel execuVon of programs wriGen in higher-‐level languages across a collecVon of input data

•  More than 10,000 MR programs at Google in 4 years

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 17

What if Ran Google Workload on EC2?

Aug-04 Mar-06 Sep-07 Sep-09 Number of MapReduce jobs 29,000 171,000 2,217,000 3,467,000 Average completion time (secs) 634 874 395 475 Server years used 217 2,002 11,081 25,562 Input data read (TB) 3,288 52,254 403,152 544,130 Intermediate data (TB) 758 6,743 34,774 90,120

Output data written (TB) 193 2,970 14,018 57,520 Average number of servers per job 157 268 394 488 Average Cost/job EC2 $17 $39 $26 $38 Annual Cost if on EC2 $0.5M $6.7M $57.4M $133.1M

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 18

1/20/11

4

Agenda

•  Request and Data Level Parallelism •  Administrivia + The secret to gejng good grades at Berkeley

•  MapReduce

•  MapReduce Examples

•  Technology Break •  Costs in Warehouse Scale Computer (if Vme permits)

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 19

Course InformaVon •  Course Web: hGp://inst.eecs.Berkeley.edu/~cs61c/sp11 •  Instructors:

–  Randy Katz, Dave PaGerson •  Teaching Assistants:

–  Andrew Gearhart, Conor Hughes, Yunsup Lee, Ari Rabkin, Charles Reiss, Andrew Waterman, Vasily Volkov

•  Textbooks: Average 15 pages of reading/week –  PaGerson & Hennessey, Computer Organiza:on and Design,

4th EdiVon (not ≤3rd EdiVon, not Asian version 4th ediVon) –  Kernighan & Ritchie, The C Programming Language, 2nd EdiVon –  Barroso & Holzle, The Datacenter as a Computer, 1st Edi:on

•  Google Group: –  61CSpring2011UCB-‐announce: announcements from staff –  61CSpring2011UCB-‐disc: Q&A, discussion by anyone in 61C –  Email Andrew Gearhart [email protected] to join

1/20/11 Spring 2011 -‐-‐ Lecture #1 20

This Week

•  Discussions and labs will be held this week – Switching SecVons: if you find another 61C student willing to swap discussion AND lab, talk to your TAs

– Partner (only project 3 and extra credit): OK if partners mix secVons but have same TA

•  First homework assignment due this Sunday January 23rd by 11:59:59 PM – There is reading assignment as well on course page

1/20/11 Spring 2011 -‐-‐ Lecture #1 21

Course OrganizaVon

•  Grading –  ParVcipaVon and Altruism (5%) –  Homework (5%) –  Labs (20%) –  Projects (40%)

1.  Data Parallelism (Map-‐Reduce on Amazon EC2) 2.  Computer InstrucVon Set Simulator (C) 3.  Performance Tuning of a Parallel ApplicaVon/Matrix MulVply

using cache blocking, SIMD, MIMD (OpenMP, due with partner) 4.  Computer Processor Design (Logisim)

–  Extra Credit: Matrix MulVply CompeVVon, anything goes – Midterm (10%): 6-‐9 PM Tuesday March 8 –  Final (20%): 11:30-‐2:30 PM Monday May 9

1/20/11 Spring 2011 -‐-‐ Lecture #1 22

EECS Grading Policy

•  hGp://www.eecs.berkeley.edu/Policies/ugrad.grading.shtml

“A typical GPA for courses in the lower division is 2.7. This GPA would result, for example, from 17% A's, 50% B's, 20% C's, 10% D's, and 3% F's. A class whose GPA falls outside the range 2.5 -‐ 2.9 should be considered atypical.”

•  Fall 2010: GPA 2.81 26% A's, 47% B's, 17% C's, 3% D's, 6% F's

•  Job/Intern Interviews: They grill you with technical quesVons, so it’s what you say, not your GPA

(New 61c gives good stuff to say) 1/20/11 Spring 2011 -‐-‐ Lecture #1 23

Fall Spring

2010 2.81 2.81

2009 2.71 2.81

2008 2.95 2.74

2007 2.67 2.76

Late Policy •  Assignments due Sundays at 11:59:59 PM •  Late homeworks not accepted (100% penalty) •  Late projects get 20% penalty, accepted up to Tuesdays at 11:59:59 PM – No credit if more than 48 hours late – No “slip days” in 61C •  Used by Dan Garcia and a few faculty to cope with 100s of students who o}en procrasVnate without having to hear the excuses, but not widespread in EECS courses

•  More late assignments if everyone has no-‐cost opVons; beGer to learn now how to cope with real deadlines

1/20/11 Spring 2011 -‐-‐ Lecture #1 24

1/20/11

5

Policy on Assignments and Independent Work

•  With the excepVon of laboratories and assignments that explicitly permit you to work in groups, all homeworks and projects are to be YOUR work and your work ALONE.

•  You are encouraged to discuss your assignments with other students, and extra credit will be assigned to students who help others, parVcularly by answering quesVons on the Google Group, but we expect that what you hand is yours.

•  It is NOT acceptable to copy soluVons from other students. •  It is NOT acceptable to copy (or start your) soluVons from the Web. •  We have tools and methods, developed over many years, for

detecVng this. You WILL be caught, and the penalVes WILL be severe. •  At the minimum a ZERO for the assignment, possibly an F in the

course, and a leGer to your university record documenVng the incidence of cheaVng.

•  (We caught people last semester!)

1/20/11 Spring 2011 -‐-‐ Lecture #1 25

YOUR BRAIN ON COMPUTERS; Hooked on Gadgets, and Paying a Mental Price

NY Times, June 7, 2010, by MaG Richtel

SAN FRANCISCO -‐-‐ When one of the most important e-‐mail messages of his life landed in his in-‐box a few years ago, Kord Campbell overlooked it.

Not just for a day or two, but 12 days. He finally saw it while si}ing through old messages: a big company wanted to buy his Internet start-‐up. ''I stood up from my desk and said, 'Oh my God, oh my God, oh my God,' '' Mr. Campbell said. ''It's kind of hard to miss an e-‐mail like that, but I did.''

The message had slipped by him amid an electronic flood: two computer screens alive with e-‐mail, instant messages, online chats, a Web browser and the computer code he was wriVng.

While he managed to salvage the $1.3 million deal a}er apologizing to his suitor, Mr. Campbell conVnues to struggle with the effects of the deluge of data. Even a}er he unplugs, he craves the sVmulaVon he gets from his electronic gadgets. He forgets things like dinner plans, and he has trouble focusing on his family. His wife, Brenda, complains, ''It seems like he can no longer be fully in the moment.''

This is your brain on computers.

ScienVsts say juggling e-‐mail, phone calls and other incoming informaVon can change how people think and behave. They say our ability to focus is being undermined by bursts of informaVon. These play to a primiVve impulse to respond to immediate opportuniVes and threats. The sVmulaVon provokes excitement -‐-‐ a dopamine squirt -‐-‐ that researchers say can be addicVve. In its absence, people feel bored.

The resulVng distracVons can have deadly consequences, as when cellphone-‐wielding drivers and train engineers cause wrecks. And for millions of people like Mr. Campbell, these urges can inflict nicks and cuts on creaVvity and deep thought, interrupVng work and family life.

While many people say mulVtasking makes them more producVve, research shows otherwise. Heavy mulVtaskers actually have more trouble focusing and shujng out irrelevant informaVon, scienVsts say, and they experience more stress.

And scienVsts are discovering that even a}er the mulVtasking ends, fractured thinking and lack of focus persist. In other words, this is also your brain off computers.

Fall 2010 -‐-‐ Lecture #2 26

The Rules (and we really mean it!)

27 Fall 2010 -‐-‐ Lecture #2

Architecture of a Lecture

1/20/11 Spring 2011 -‐-‐ Lecture #1 28

AGenVon

Time (minutes)

0 20 25 50 53 78 80

Administrivia “And in conclusion…”

Tech break

Full

61C in the News

•  IEEE Spectrum Top 11 InnovaVons of the Decade

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 29

61C

61C 61C

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 30

Ra!e for prizes!!Special thanks to companies donating prizes:

Linear Technologies T- shirtsMicrosoft Windows 7 & O!ce 2010Yelp T-shirts & BeanieWhite Pages 1-yr subs ription Mobile look-up app for iphone or Andriod

and other prizes donated by Autodesk Facebook

AdobeAerospace Corp.Agilent TechnologiesAmazon.comApple, Inc.Arista NetworksAutodesk, Inc.CBS InteractiveCisco SystemsCitadelebay

EMC2 EricssonFacebookGoogleHewlett PackardIntelKLA TencorLab 126Linear TechnologiesLockheed Martin

MicrosoftNational InstrumentsNetAppNvidiaOraclePalantir TechnologiesPixarRed!nRiverbedsalesforce.comShoretelSplunkStarkeySybaseTibcoTwittervmwareWhitepagesYahoo!YelpZynga

Industrial Relations O"ce

InternshipOpenhouse

EECS

2011Monday, January 2411:00 a.m. — 2:00 p.m.

Pauley Ballroom, UC Berkeley

Summer Internship Opportunities(MLK Jr. Student Union, Bancroft Ave. at Telegraph)

1/20/11

6

The Secret to Gejng Good Grades

•  Grad student said he figured finally it out –  (Mike Dahlin, now Professor at UT Texas)

•  My quesVon: What is the secret? •  Do assigned reading night before, so that get more value from lecture

•  Fall 61c Comment on End-‐of-‐Semester Survey: “I wish I had followed Professor PaGerson's advice and did the reading before each lecture.”

Fall 2010 -‐-‐ Lecture #2 31

MapReduce Processing Example: Count Word Occurrences

•  Pseudo Code: Simple case of assuming just 1 word per document

map(String input_key, String input_value):! // input_key: document name! // input_value: document contents! for each word w in input_value:! EmitIntermediate(w, "1"); // Produce count of words!

reduce(String output_key, Iterator intermediate_values):! // output_key: a word! // output_values: a list of counts! int result = 0;! for each v in intermediate_values:! result += ParseInt(v); // get integer from key-value! Emit(AsString(result));!

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 32

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 33

MapReduce Processing

Shuffle phase


1/20/11 Apeinf 2011 -‐-‐ Lecture #2 34

1. MR 1st splits the input files into M “splits” then starts many copies of program on servers

Shuffle phase


1/20/11 Apeinf 2011 -‐-‐ Lecture #2 35

2. One copy—the master— is special. The rest are workers. The master picks idle workers and assigns each 1 of M map tasks or 1 of R reduce tasks.

Shuffle phase


1/20/11 Apeinf 2011 -‐-‐ Lecture #2 36

3. A map worker reads the input split. It parses key/value pairs of the input data and passes each pair to the user-‐defined map funcVon.

(The intermediate key/value pairs produced by the map funcVon are buffered in memory.)

Shuffle phase

1/20/11

7


1/20/11 Apeinf 2011 -‐-‐ Lecture #2 37

4. Periodically, the buffered pairs are wriGen to local disk, parVVoned into R regions by the parVVoning funcVon.

Shuffle phase


1/20/11 Apeinf 2011 -‐-‐ Lecture #2 38

5. When a reduce worker has read all intermediate data for its parVVon, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.

(The sorVng is needed because typically many different keys map to the same reduce task )

Shuffle phase


1/20/11 Apeinf 2011 -‐-‐ Lecture #2 39

6. Reduce worker iterates over sorted intermediate data and for each unique intermediate key, it passes key and corresponding set of values to the user’s reduce funcVon.

The output of the reduce funcVon is appended to a final output file for this reduce parVVon.

Shuffle phase


1/20/11 Apeinf 2011 -‐-‐ Lecture #2 40

7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. The MapReduce call In user program returns back to user code.

Output of MR is in R output files (1 per reduce task, with file names specified by user); o}en passed into another MR job.

Shuffle phase

MapReduce Processing Time Line

•  Master assigns map + reduce tasks to “worker” servers •  As soon as a map task finishes, worker server can be

assigned a new map or reduce task •  Data shuffle begins as soon as a given Map finishes •  Reduce task begins as soon as all data shuffles finish •  To tolerate faults, reassign task if a worker server “dies” 1/20/11 Apeinf 2011 -‐-‐ Lecture #2 41

Show MapReduce Job Running

•  ~41 minutes total – ~29 minutes for Map tasks & Shuffle tasks

– ~12 minutes for Reduce tasks – 1707 worker servers used

•  Map (Green) tasks read 0.8 TB, write 0.5 TB

•  Shuffle (Red) tasks read 0.5 TB, write 0.5 TB

•  Reduce (Blue) tasks read 0.5 TB, write 0.5 TB

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 42

1/20/11

8

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 43 1/20/11 Apeinf 2011 -‐-‐ Lecture #2 44



1/20/11

9



1/20/11 Apeinf 2011 -‐-‐ Lecture #2 53

MapReduce Processing: ExecuVon

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 54

Map

Reduce

Map key k1 hashes to Reduce Task 2 Map key k2 hashes to Reduce Task 1 Map key k3 hashes to Reduce Task 2 Map key k4 hashes to Reduce Task 1 Map key k5 hashes to Reduce Task 1

Shuffle Reduce/Combine Collect Results

1/20/11

10

Another Example: Word Index (How O}en Does a Word Appear?)

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 55

that that is is that that is not is not is that it it is

is 1, that 2 is 1, that 2 is 2, not 2 is 2, it 2, that 1 Map 1 Map 2 Map 3 Map 4

Reduce 1 Reduce 2

is 1 that 2 is 1,1 that 2,2 is 1,1,2,2 it 2

that 2,2,1 not 2

is 6; it 2 not 2; that 5

Shuffle

Collect

is 6; it 2; not 2; that 5

Distribute

MapReduce Failure Handling

•  On worker failure: – Detect failure via periodic heartbeats –  Re-‐execute completed and in-‐progress map tasks

–  Re-‐execute in progress reduce tasks –  Task compleVon commiGed through master

•  Master failure: –  Could handle, but don't yet (master failure unlikely)

•  Robust: lost 1600 of 1800 machines once, but finished fine

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 56

MapReduce Redundant ExecuVon

•  Slow workers significantly lengthen compleVon Vme – Other jobs consuming resources on machine –  Bad disks with so} errors transfer data very slowly – Weird things: processor caches disabled (!!)

•  SoluVon: Near end of phase, spawn backup copies of tasks – Whichever one finishes first "wins"

•  Effect: DramaVcally shortens job compleVon Vme –  3% more resources, large tasks 30% faster

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 57

Impact on ExecuVon of Restart, Failure for 10B record Sort using 1800 servers

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 58

MapReduce Locality OpVmizaVon

•  Master scheduling policy: – Asks GFS (Google File System) for locaVons of replicas of input file blocks

– Map tasks typically split into 64MB (== GFS block size) – Map tasks scheduled so GFS input block replica are on same machine or same rack

•  Effect: Thousands of machines read input at local disk speed

•  Without this, rack switches limit read rate

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 59

Agenda

•  Request and Data Level Parallelism •  Administrivia + The secret to gejng good grades at Berkeley

•  MapReduce

•  MapReduce Examples

•  Technology Break •  Costs in Warehouse Scale Computer (if Vme permits)

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 60

1/20/11

11

Design Goals of a WSC

•  Unique to Warehouse-‐scale – Ample parallelism:

•  Batch apps: large number independent data sets with independent processing. Also known as Data-‐Level Parallelism

–  Scale and its Opportuni:es/Problems •  RelaVvely small number of these make design cost expensive and difficult to amorVze

•  But price breaks are possible from purchases of very large numbers of commodity servers

•  Must also prepare for high component failures – Opera:onal Costs Count:

•  Cost of equipment purchases << cost of ownership

1/20/11 Fall 2010 -‐-‐ Lecture #37 61

Internet

WSC Case Study Server Provisioning

1/20/11 Fall 2010 -‐-‐ Lecture #37 62

WSC Power Capacity 8.00 MW Power Usage Effectiveness (PUE) 1.45 IT Equipment Power Share 0.67 5.36 MW Power/Cooling Infrastructure 0.33 2.64 MW IT Equipment Measured Peak (W) 145.00 Assume Average Pwr @ 0.8 Peak 116.00 # of Servers 46207

# of Servers 46000 # of Servers per Rack 40.00 # of Racks 1150 Top of Rack Switches 1150 # of TOR Switch per L2 Switch 16.00 # of L2 Switches 72 # of L2 Switches per L3 Switch 24.00 # of L3 Switches 3

Rack …

Server

TOR Switch

L3 Switch

L2 Switch …

Cost of WSC

•  US account pracVce separates purchase price and operaVonal costs

•  Capital Expenditure (CAPEX) is cost to buy equipment (e.g. buy servers)

•  OperaVonal Expenditure (OPEX) is cost to run equipment (e.g, pay for electricity used)

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 63

WSC Case Study Capital Expenditure (Capex)

•  Facility cost and total IT cost look about the same

1/20/11 Fall 2010 -‐-‐ Lecture #37 64

Facility Cost $88,000,000 Total Server Cost $66,700,000 Total Network Cost $12,810,000 Total Cost $167,510,000

•  However, replace servers every 3 years, networking gear every 4 years, and facility every 10 years

Cost of WSC

•  US account pracVce allow converVng Capital Expenditure (CAPEX) into OperaVonal Expenditure (OPEX) by amorVzing costs over Vme period – Servers 3 years – Networking gear 4 years – Facility 10 years

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 65

WSC Case Study OperaVonal Expense (Opex)

1/20/11 Fall 2010 -‐-‐ Lecture #37 66

Server 3 $66,700,000 $2,000,000 55% Network 4 $12,530,000 $295,000 8% Facility $88,000,000 Pwr&Cooling 10 $72,160,000 $625,000 17% Other 10 $15,840,000 $140,000 4% Amortized Cost $3,060,000 Power (8MW) $0.07 $475,000 13% People (3) $85,000 2% Total Monthly $3,620,000 100%

Monthly Cost Years

AmorVzaVon

$/kWh

Amor:zed Capital Expense

Opera:onal Expense

•  Monthly Power costs •  $475k for electricity •  $625k + $140k to amorVze facility power distribuVon and cooling •  60% is amorVzed power distribuVon and cooling

1/20/11

12

How much does a waG cost in a WSC?

•  8 MW facility •  AmorVzed facility, including power distribuVon and cooling is $625k + $140k = $765k

•  Monthly Power Usage = $475k

•  WaG-‐Year = ($765k+$475k)*12/8M = $1.86 or about $2 per year

•  To save a waG, if spend more than $2 a year, lose money

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 67

Replace RotaVng Disks with Flash? •  Flash is non-‐volaVle semiconductor memory – Costs about $20 / GB, Capacity about 10 GB – Power about 0.01 WaGs

•  Disk is non-‐volaVle rotaVng storage – Costs about $0.1 / GB, Capacity about 1000 GB – Power about 10 WaGs

•  Should replace Disk with Flash to save money?

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 68

A red) No: Capex Costs are 100:1 of OpEx savings!

B orange) Don’t have enough informaCon to answer quesCon

C green) Yes: Return investment in a single year!

Replace RotaVng Disks with Flash? •  Flash is non-‐volaVle semiconductor memory – Costs about $20 / GB, Capacity about 10 GB – Power about 0.01 WaGs

•  Disk is non-‐volaVle rotaVng storage – Costs about $0.1 / GB, Capacity about 1000 GB – Power about 10 WaGs

•  Should replace Disk with Flash to save money?

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 69

A red) No: Capex Costs are 100:1 of OpEx savings!

B orange) Don’t have enough informaCon to answer quesCon

C green) Yes: Return investment in a single year!

WSC Case Study OperaVonal Expense (Opex)

•  $3.8M/46000 servers = ~$80 per month per server in revenue to break even

•  ~$80/720 hours per month = $0.11 per hour •  So how does Amazon EC2 make money??? 1/20/11 Fall 2010 -‐-‐ Lecture #37 70

Server 3 $66,700,000 $2,000,000 55% Network 4 $12,530,000 $295,000 8% Facility $88,000,000 Pwr&Cooling 10 $72,160,000 $625,000 17% Other 10 $15,840,000 $140,000 4% Amortized Cost $3,060,000 Power (8MW) $0.07 $475,000 13% People (3) $85,000 2% Total Monthly $3,620,000 100%

Monthly Cost Years

AmorVzaVon

$/kWh

Amor:zed Capital Expense

Opera:onal Expense

January 2011 AWS Instances & Prices

•  Closest computer in WSC example is Standard Extra Large

•  @$0.11/hr, Amazon EC2 can make money! –  even if used only 50% of Vme

1/20/11 Fall 2010 -‐-‐ Lecture #37 71

Instance Per Hour

Ratio to

Small

Compute Units

Virtual Cores

Compute Unit/ Core

Memory (GB)

Disk (GB)

Address

Standard Small $0.085 1.0 1.0 1 1.00 1.7 160 32 bit Standard Large $0.340 4.0 4.0 2 2.00 7.5 850 64 bit Standard Extra Large $0.680 8.0 8.0 4 2.00 15.0 1690 64 bit High-Memory Extra Large $0.500 5.9 6.5 2 3.25 17.1 420 64 bit High-Memory Double Extra Large $1.000 11.8 13.0 4 3.25 34.2 850 64 bit High-Memory Quadruple Extra Large $2.000 23.5 26.0 8 3.25 68.4 1690 64 bit High-CPU Medium $0.170 2.0 5.0 2 2.50 1.7 350 32 bit High-CPU Extra Large $0.680 8.0 20.0 8 2.50 7.0 1690 64 bit Cluster Quadruple Extra Large $1.600 18.8 33.5 8 4.20 23.0 1690 64 bit

Summary •  Request-‐Level Parallelism –  High request volume, each largely independent of other –  Use replicaVon for beGer request throughput, availability

•  MapReduce Data Parallelism –  Divide large data set into pieces for independent parallel processing

–  Combine and process intermediate results to obtain final result

•  WSC CapEx vs. OpEx –  Servers dominate cost –  Spend more on power distribuVon and cooling infrastructure than on monthly electricity costs

–  Economies of scale mean WSC can sell compuVng as a uVlity

1/20/11 Apeinf 2011 -‐-‐ Lecture #2 72

Documents

Google Query-‐Serving Architecture