12
1/20/11 1 CS 61C: Great Ideas in Computer Architecture (Machine Structures) Instructors: Randy H. Katz David A. PaGerson hGp://inst.eecs.Berkeley.edu/~cs61c/sp11 1 Apeinf 2011 Lecture #2 1/20/11 Review CS61c: Learn 5 great ideas in computer architecture to enable high performance programming via parallelism, not just learn C 1. Layers of RepresentaVon/InterpretaVon 2. Moore’s Law 3. Principle of Locality/Memory Hierarchy 4. Parallelism 5. Performance Measurement and Improvement 6. Dependability via Redundancy Post PC Era: Parallel processing, smart phones to WSC WSC SW must cope with failures, varying load, varying HW latency bandwidth WSC HW sensiVve to cost, energy efficiency 2 Spring 2011 Lecture #1 1/20/11 NewSchool Machine Structures (It’s a bit more complicated!) Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel InstrucVons >1 instrucVon @ one Vme e.g., 5 pipelined instrucVons Parallel Data >1 data item @ one Vme e.g., Add of 4 pairs of words Hardware descripVons All gates @ one Vme 1/20/11 Spring 2011 Lecture #1 3 Smart Phone Warehouse Scale Computer So#ware Hardware Harness Parallelism & Achieve High Performance Logic Gates Core Core Memory (Cache) Input/Output Computer Main Memory Core InstrucVon Unit(s) FuncVonal Unit(s) A 3 +B 3 A 2 +B 2 A 1 +B 1 A 0 +B 0 Today’s Lecture Agenda Request and Data Level Parallelism Administrivia + 61C in the News + Internship Workshop + The secret to gejng good grades at Berkeley MapReduce MapReduce Examples Technology Break Costs in Warehouse Scale Computer (if Vme permits) 1/20/11 Apeinf 2011 Lecture #2 4 RequestLevel Parallelism (RLP) Hundreds or thousands of requests per second Not your laptop or cellphone, but popular Internet services like Google search Such requests are largely independent Mostly involve readonly databases LiGle readwrite (aka “producerconsumer”) sharing Rarely involve read–write data sharing or synchronizaVon across requests ComputaVon easily parVVoned within a request and across different requests 1/20/11 Apeinf 2011 Lecture #2 5 Google QueryServing Architecture 1/20/11 Apeinf 2011 Lecture #2 6

Google Query-‐Serving Architecture

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Google Query-‐Serving Architecture

1/20/11  

1  

CS  61C:  Great  Ideas  in  Computer  Architecture  (Machine  Structures)  

Instructors:  Randy  H.  Katz  

David  A.  PaGerson  hGp://inst.eecs.Berkeley.edu/~cs61c/sp11  

1  Apeinf  2011  -­‐-­‐  Lecture  #2  1/20/11  

Review  •  CS61c:  Learn  5  great  ideas  in  computer  architecture  to  enable  high  performance  programming  via  parallelism,  not  just  learn  C  1.  Layers  of  RepresentaVon/InterpretaVon  2.  Moore’s  Law  3.  Principle  of  Locality/Memory  Hierarchy  4.  Parallelism  5.  Performance  Measurement  and  Improvement  6.  Dependability  via  Redundancy  

•  Post  PC  Era:  Parallel  processing,  smart  phones  to  WSC  •  WSC  SW  must  cope  with  failures,  varying  load,  varying  HW  latency  bandwidth  

•  WSC  HW  sensiVve  to  cost,  energy  efficiency  2  Spring  2011  -­‐-­‐  Lecture  #1  1/20/11  

New-­‐School  Machine  Structures  (It’s  a  bit  more  complicated!)  

•  Parallel  Requests  Assigned  to  computer  e.g.,  Search  “Katz”  

•  Parallel  Threads  Assigned  to  core  e.g.,  Lookup,  Ads  

•  Parallel  InstrucVons  >1  instrucVon  @  one  Vme  e.g.,  5  pipelined  instrucVons  

•  Parallel  Data  >1  data  item  @  one  Vme  e.g.,  Add  of  4  pairs  of  words  

•  Hardware  descripVons  All  gates  @  one  Vme  

1/20/11   Spring  2011  -­‐-­‐  Lecture  #1   3  

Smart  Phone  

Warehouse  Scale  

Computer  

So#ware                Hardware  

Harness  Parallelism  &  Achieve  High  Performance  

Logic  Gates  

Core   Core  …  

         Memory                              (Cache)  

Input/Output  

Computer  

Main  Memory  

Core  

                 InstrucVon  Unit(s)                FuncVonal  Unit(s)  

A3+B3  A2+B2  A1+B1  A0+B0  

Today’s  Lecture  Agenda  

•  Request  and  Data  Level  Parallelism  •  Administrivia  +  61C  in  the  News  +  Internship  Workshop  +  The  secret  to  gejng  good  grades  at  Berkeley  

•  MapReduce  •  MapReduce  Examples  •  Technology  Break  •  Costs    in  Warehouse  Scale  Computer  (if  Vme  permits)  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   4  

Request-­‐Level  Parallelism  (RLP)  

•  Hundreds  or  thousands  of  requests  per  second  – Not  your  laptop  or  cell-­‐phone,  but  popular  Internet  services  like  Google  search  

–  Such  requests  are  largely  independent  •  Mostly  involve  read-­‐only  databases  

•  LiGle  read-­‐write  (aka  “producer-­‐consumer”)  sharing  •  Rarely  involve  read–write  data  sharing  or  synchronizaVon  across  requests  

•  ComputaVon  easily  parVVoned  within  a  request  and  across  different  requests  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   5  

Google  Query-­‐Serving  Architecture  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   6  

Page 2: Google Query-‐Serving Architecture

1/20/11  

2  

Anatomy  of  a  Web  Search  

•  Google  “Randy  H.  Katz”  – Direct  request  to  “closest”  Google  Warehouse  Scale  Computer  

–  Front-­‐end  load  balancer  directs  request  to  one  of  many  arrays  (cluster  of  servers)  within  WSC  

– Within  array,  select  one  of  many  Google  Web  Servers  (GWS)  to  handle  the  request  and  compose  the  response  pages  

– GWS  communicates  with  Index  Servers  to  find  documents  that  contain  the  search  words,  “Randy”,  “Katz”,  uses  locaVon  of  search  as  well  

–  Return  document  list  with  associated  relevance  score  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   7  

Anatomy  of  a  Web  Search  

•  In  parallel,  – Ad  system:  books  by  Katz  at  Amazon.com  –  Images  of  Randy  Katz  

•  Use  docids  (document  IDs)  to  access  indexed  documents    

•  Compose  the  page  –  Result  document  extracts  (with  keyword  in  context)  ordered  by  relevance  score  

–  Sponsored  links  (along  the  top)  and  adverVsements  (along  the  sides)  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   8  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   9  

Anatomy  of  a  Web  Search  

•  ImplementaVon  strategy  – Randomly  distribute  the  entries  – Make  many  copies  of  data  (aka  “replicas”)  – Load  balance  requests  across  replicas  

•  Redundant  copies  of  indices  and  documents  – Breaks  up  hot  spots,  e.g.,  “JusVn  Bieber”  –  Increases  opportuniVes  for  request-­‐level  parallelism  

– Makes  the  system  more  tolerant  of  failures  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   10  

Data-­‐Level  Parallelism  (DLP)  

•  2  kinds  – Lots  of  data  in  memory  that  can  be  operated    on  in  parallel  (e.g.,  adding  together  2  arrays)  

– Lots  of  data  on  many  disks  that  can  be  operated  on  in  parallel  (e.g.,  searching  for  documents)  

•  March  1  lecture  and  3rd  project  does  Data  Level  Parallelism  (DLP)  in  memory  

•  Today’s  lecture  and  1st  project  does  DLP  across  1000s  of  servers  and  disks  using  MapReduce  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   11  

Problem  Trying  To  Solve  

•  How  process  large  amounts  of  raw  data  (crawled  documents,  request  logs,  …)  every  day  to  compute  derived  data  (inverted  indicies,  page  popularity,  …)  when  computaVon  conceptually  simple  but  input  data  large  and  distributed  across  100s  to  1000s  of  servers  so  that  finish  in  reasonable  Vme?  

•  Challenge:  Parallelize  computaVon,  distribute  data,  tolerate  faults  without  obscuring  simple  computaVon  with  complex  code  to  deal  with  issues  

•  Jeffrey  Dean  and  Sanjay  Ghemawat,  “MapReduce:  Simplified  Data  Processing  on  Large  Clusters,”  Communica:ons  of  the  ACM,  Jan  2008.    

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   12  

Page 3: Google Query-‐Serving Architecture

1/20/11  

3  

MapReduce  SoluVon  

•  Apply  Map  funcVon  to  user  supplier  record  of  key/value  pairs  

•  Compute  set  of  intermediate  key/value  pairs  •  Apply  Reduce  operaVon  to  all  values  that  share  same  key  in  order  to  combine  derived  data  properly  

•  User  supplies  Map  and  Reduce  operaVons  in  funcVonal  model  so  can  parallelize,  using  re-­‐execuVon  for  fault  tolerance  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   13  

Data-­‐Parallel  “Divide  and  Conquer”  (MapReduce  Processing)  

•  Map:  –  Slice  data  into  “shards”  or  “splits”,  distribute  these  to  workers,  compute  sub-­‐problem  soluVons  

–  map(in_key,in_value)->list(out_key,intermediate value)!•  Processes  input  key/value  pair  •  Produces  set  of  intermediate  pairs  

•  Reduce:  –  Collect  and  combine  sub-­‐problem  soluVons  –  reduce(out_key,list(intermediate_value))->list(out_value)!

•  Combines  all  intermediate  values  for  a  parVcular  key  •  Produces  a  set  of  merged  output  values  (usually  just  one)  

•  Fun  to  use:  focus  on  problem,  let  MapReduce  library  deal  with  messy  details  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   14  

MapReduce  ExecuVon  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   15  

Fine  granularity  tasks:  many  more  map  tasks  than  machines  

2000  servers  =>    ≈  200,000  Map  Tasks,  ≈  5,000  Reduce  tasks  

MapReduce  Popularity  at  Google  

Aug-04 Mar-06 Sep-07 Sep-09 Number of MapReduce jobs 29,000 171,000 2,217,000 3,467,000 Average completion time (secs) 634 874 395 475 Server years used 217 2,002 11,081 25,562 Input data read (TB) 3,288 52,254 403,152 544,130 Intermediate data (TB) 758 6,743 34,774 90,120

Output data written (TB) 193 2,970 14,018 57,520 Average number servers /job 157 268 394 488

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   16  

Google  Uses  MapReduce  For  …  

•  ExtracVng  the  set  of  outgoing  links  from  a  collecVon  of  HTML  documents  and  aggregaVng  by  target  document  

•  SVtching  together  overlapping  satellite  images  to  remove  seams  and  to  select  high-­‐quality  imagery  for  Google  Earth  

•  GeneraVng  a  collecVon  of  inverted  index  files  using  a  compression  scheme  tuned  for  efficient  support  of  Google  search  queries  

•  Processing  all  road  segments  in  the  world  and  rendering  map  Vle  images  that  display  these  segments  for  Google  Maps  

•  Fault-­‐tolerant  parallel  execuVon  of  programs  wriGen  in  higher-­‐level  languages  across  a  collecVon  of  input  data  

•  More  than  10,000  MR  programs  at  Google  in  4  years  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   17  

What  if  Ran  Google  Workload  on  EC2?  

Aug-04 Mar-06 Sep-07 Sep-09 Number of MapReduce jobs 29,000 171,000 2,217,000 3,467,000 Average completion time (secs) 634 874 395 475 Server years used 217 2,002 11,081 25,562 Input data read (TB) 3,288 52,254 403,152 544,130 Intermediate data (TB) 758 6,743 34,774 90,120

Output data written (TB) 193 2,970 14,018 57,520 Average number of servers per job 157 268 394 488 Average Cost/job EC2 $17 $39 $26 $38 Annual Cost if on EC2 $0.5M $6.7M $57.4M $133.1M

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   18  

Page 4: Google Query-‐Serving Architecture

1/20/11  

4  

Agenda  

•  Request  and  Data  Level  Parallelism  •  Administrivia  +  The  secret  to  gejng  good  grades  at  Berkeley  

•  MapReduce  

•  MapReduce  Examples  

•  Technology  Break  •  Costs    in  Warehouse  Scale  Computer  (if  Vme  permits)  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   19  

Course  InformaVon  •  Course  Web:  hGp://inst.eecs.Berkeley.edu/~cs61c/sp11      •  Instructors:    

–  Randy  Katz,  Dave  PaGerson    •  Teaching  Assistants:    

–  Andrew  Gearhart,  Conor  Hughes,  Yunsup  Lee,  Ari  Rabkin,  Charles  Reiss,  Andrew  Waterman,  Vasily  Volkov  

•  Textbooks:  Average  15  pages  of  reading/week  –  PaGerson  &  Hennessey,  Computer  Organiza:on  and  Design,    

4th  EdiVon  (not  ≤3rd  EdiVon,  not  Asian  version  4th  ediVon)  –  Kernighan  &  Ritchie,  The  C  Programming  Language,  2nd  EdiVon  –  Barroso  &  Holzle,  The  Datacenter  as  a  Computer,  1st  Edi:on  

•  Google  Group:    –  61CSpring2011UCB-­‐announce:  announcements  from  staff  –  61CSpring2011UCB-­‐disc:    Q&A,  discussion  by  anyone  in  61C  –  Email  Andrew  Gearhart  [email protected]  to  join  

1/20/11   Spring  2011  -­‐-­‐  Lecture  #1   20  

This  Week  

•  Discussions  and  labs  will  be  held  this  week  – Switching  SecVons:  if  you  find  another  61C  student  willing  to  swap  discussion  AND  lab,  talk  to  your  TAs  

– Partner  (only  project  3  and  extra  credit):  OK  if  partners  mix  secVons  but  have  same  TA  

•  First  homework  assignment  due  this    Sunday  January  23rd  by  11:59:59  PM  – There  is  reading  assignment  as  well  on  course  page  

1/20/11   Spring  2011  -­‐-­‐  Lecture  #1   21  

Course  OrganizaVon  

•  Grading  –  ParVcipaVon  and  Altruism  (5%)  –  Homework  (5%)  –  Labs  (20%)  –  Projects  (40%)  

1.  Data  Parallelism  (Map-­‐Reduce  on  Amazon  EC2)  2.  Computer  InstrucVon  Set  Simulator  (C)  3.  Performance  Tuning  of  a  Parallel  ApplicaVon/Matrix  MulVply  

using  cache  blocking,  SIMD,  MIMD  (OpenMP,  due  with  partner)  4.  Computer  Processor  Design  (Logisim)  

–  Extra  Credit:  Matrix  MulVply  CompeVVon,  anything  goes  – Midterm  (10%):  6-­‐9  PM  Tuesday  March  8    –  Final  (20%):  11:30-­‐2:30  PM  Monday  May  9  

1/20/11   Spring  2011  -­‐-­‐  Lecture  #1   22  

EECS  Grading  Policy  

•  hGp://www.eecs.berkeley.edu/Policies/ugrad.grading.shtml  

 “A  typical  GPA  for  courses  in  the  lower  division  is  2.7.  This  GPA  would  result,  for  example,  from  17%  A's,  50%  B's,  20%  C's,  10%  D's,  and  3%  F's.  A  class  whose  GPA  falls  outside  the  range  2.5  -­‐  2.9  should  be  considered  atypical.”  

•  Fall  2010:  GPA  2.81    26%  A's,  47%  B's,  17%  C's,    3%  D's,  6%  F's  

•  Job/Intern  Interviews:  They  grill  you  with  technical  quesVons,  so  it’s  what  you  say,  not  your  GPA  

 (New  61c  gives  good  stuff  to  say)  1/20/11   Spring  2011  -­‐-­‐  Lecture  #1   23  

Fall   Spring  

2010   2.81   2.81  

2009   2.71   2.81  

2008   2.95   2.74  

2007   2.67   2.76  

Late  Policy  •  Assignments  due  Sundays  at  11:59:59  PM  •  Late  homeworks  not  accepted  (100%  penalty)  •  Late  projects  get  20%  penalty,  accepted  up  to  Tuesdays  at  11:59:59  PM  – No  credit  if    more  than  48  hours  late  – No  “slip  days”  in  61C  •  Used  by  Dan  Garcia  and  a  few  faculty  to  cope  with  100s  of  students  who  o}en  procrasVnate  without  having  to  hear  the  excuses,  but  not  widespread  in  EECS  courses  

•  More  late  assignments  if  everyone  has  no-­‐cost  opVons;  beGer  to  learn  now  how  to  cope  with  real  deadlines  

1/20/11   Spring  2011  -­‐-­‐  Lecture  #1   24  

Page 5: Google Query-‐Serving Architecture

1/20/11  

5  

Policy  on  Assignments  and  Independent  Work  

•  With  the  excepVon  of  laboratories  and  assignments  that  explicitly  permit  you  to  work  in  groups,  all  homeworks  and  projects  are  to  be  YOUR  work  and  your  work  ALONE.  

•  You  are  encouraged  to  discuss  your  assignments  with  other  students,  and  extra  credit  will  be  assigned  to  students  who  help  others,  parVcularly  by  answering  quesVons  on  the  Google  Group,  but  we  expect  that  what  you  hand  is  yours.  

•  It  is  NOT  acceptable  to  copy  soluVons  from  other  students.  •  It  is  NOT  acceptable  to  copy  (or  start  your)  soluVons  from  the  Web.    •  We  have  tools  and  methods,  developed  over  many  years,  for  

detecVng  this.  You  WILL  be  caught,  and  the  penalVes  WILL  be  severe.    •  At  the  minimum  a  ZERO  for  the  assignment,  possibly  an  F  in  the  

course,  and  a  leGer  to  your  university  record  documenVng  the  incidence  of  cheaVng.  

•  (We  caught  people  last  semester!)  

1/20/11   Spring  2011  -­‐-­‐  Lecture  #1   25  

YOUR  BRAIN  ON  COMPUTERS;  Hooked  on  Gadgets,  and  Paying  a  Mental  Price    

NY  Times,  June  7,  2010,  by  MaG  Richtel  

SAN  FRANCISCO  -­‐-­‐  When  one  of  the  most  important  e-­‐mail  messages  of  his  life  landed  in  his  in-­‐box  a  few  years  ago,  Kord  Campbell  overlooked  it.    

Not  just  for  a  day  or  two,  but  12  days.  He  finally  saw  it  while  si}ing  through  old  messages:  a  big  company  wanted  to  buy  his  Internet  start-­‐up.    ''I  stood  up  from  my  desk  and  said,  'Oh  my  God,  oh  my  God,  oh  my  God,'  ''  Mr.  Campbell  said.  ''It's  kind  of  hard  to  miss  an  e-­‐mail  like  that,  but  I  did.''    

The  message  had  slipped  by  him  amid  an  electronic  flood:  two  computer  screens  alive  with  e-­‐mail,  instant  messages,  online  chats,  a  Web  browser  and  the  computer  code  he  was  wriVng.    

While  he  managed  to  salvage  the  $1.3  million  deal  a}er  apologizing  to  his  suitor,  Mr.  Campbell  conVnues  to  struggle  with  the  effects  of  the  deluge  of  data.  Even  a}er  he  unplugs,  he  craves  the  sVmulaVon  he  gets  from  his  electronic  gadgets.  He  forgets  things  like  dinner  plans,  and  he  has  trouble  focusing  on  his  family.  His  wife,  Brenda,  complains,  ''It  seems  like  he  can  no  longer  be  fully  in  the  moment.''    

This  is  your  brain  on  computers.    

ScienVsts  say  juggling  e-­‐mail,  phone  calls  and  other  incoming  informaVon  can  change  how  people  think  and  behave.  They  say  our  ability  to  focus  is  being  undermined  by  bursts  of  informaVon.    These  play  to  a  primiVve  impulse  to  respond  to  immediate  opportuniVes  and  threats.  The  sVmulaVon  provokes  excitement  -­‐-­‐  a  dopamine  squirt  -­‐-­‐  that  researchers  say  can  be  addicVve.  In  its  absence,  people  feel  bored.    

The  resulVng  distracVons  can  have  deadly  consequences,  as  when  cellphone-­‐wielding  drivers  and  train  engineers  cause  wrecks.  And  for  millions  of  people  like  Mr.  Campbell,  these  urges  can  inflict  nicks  and  cuts  on  creaVvity  and  deep  thought,  interrupVng  work  and  family  life.    

While  many  people  say  mulVtasking  makes  them  more  producVve,  research  shows  otherwise.  Heavy  mulVtaskers  actually  have  more  trouble  focusing  and  shujng  out  irrelevant  informaVon,  scienVsts  say,  and  they  experience  more  stress.    

And  scienVsts  are  discovering  that  even  a}er  the  mulVtasking  ends,  fractured  thinking  and  lack  of  focus  persist.  In  other  words,  this  is  also  your  brain  off  computers.    

Fall  2010  -­‐-­‐  Lecture  #2   26  

The  Rules  (and  we  really  mean  it!)  

27  Fall  2010  -­‐-­‐  Lecture  #2  

Architecture  of  a  Lecture  

1/20/11   Spring  2011  -­‐-­‐  Lecture  #1   28  

AGenVon  

Time  (minutes)  

0   20   25   50   53   78   80  

Administrivia   “And  in    conclusion…”  

Tech    break  

Full  

61C  in  the  News  

•  IEEE  Spectrum  Top  11  InnovaVons  of  the  Decade  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   29  

61C  

61C   61C  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   30  

Ra!e for prizes!!Special thanks to companies donating prizes:

Linear Technologies T- shirtsMicrosoft Windows 7 & O!ce 2010Yelp T-shirts & BeanieWhite Pages 1-yr subs ription Mobile look-up app for iphone or Andriod

and other prizes donated by Autodesk Facebook

AdobeAerospace Corp.Agilent TechnologiesAmazon.comApple, Inc.Arista NetworksAutodesk, Inc.CBS InteractiveCisco SystemsCitadelebay

EMC2 EricssonFacebookGoogleHewlett PackardIntelKLA TencorLab 126Linear TechnologiesLockheed Martin

MicrosoftNational InstrumentsNetAppNvidiaOraclePalantir TechnologiesPixarRed!nRiverbedsalesforce.comShoretelSplunkStarkeySybaseTibcoTwittervmwareWhitepagesYahoo!YelpZynga

Industrial Relations O"ce

InternshipOpenhouse

EECS

2011Monday, January 2411:00 a.m. — 2:00 p.m.

Pauley Ballroom, UC Berkeley

Summer Internship Opportunities(MLK Jr. Student Union, Bancroft Ave. at Telegraph)

Page 6: Google Query-‐Serving Architecture

1/20/11  

6  

The  Secret  to  Gejng  Good  Grades  

•  Grad  student  said  he  figured  finally  it  out  –  (Mike  Dahlin,  now  Professor  at  UT  Texas)  

•  My  quesVon:  What  is  the  secret?  •  Do  assigned  reading  night  before,  so  that  get  more  value  from  lecture  

•  Fall  61c  Comment  on  End-­‐of-­‐Semester  Survey:  “I  wish  I  had  followed  Professor  PaGerson's  advice  and  did  the  reading  before  each  lecture.”  

Fall  2010  -­‐-­‐  Lecture  #2   31  

MapReduce  Processing  Example:  Count  Word  Occurrences  

•  Pseudo  Code:  Simple  case  of  assuming  just  1  word  per  document  

map(String input_key, String input_value):! // input_key: document name! // input_value: document contents! for each word w in input_value:! EmitIntermediate(w, "1"); // Produce count of words!

reduce(String output_key, Iterator intermediate_values):! // output_key: a word! // output_values: a list of counts! int result = 0;! for each v in intermediate_values:! result += ParseInt(v); // get integer from key-value! Emit(AsString(result));!

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   32  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   33  

MapReduce  Processing  

Shuffle  phase  

MapReduce  Processing  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   34  

1.  MR  1st  splits  the  input  files  into  M  “splits”  then  starts  many  copies  of    program  on  servers  

Shuffle  phase  

MapReduce  Processing  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   35  

2.  One  copy—the  master—  is  special.  The  rest  are  workers.  The  master  picks  idle  workers  and  assigns  each  1  of  M  map  tasks  or  1  of  R  reduce  tasks.  

Shuffle  phase  

MapReduce  Processing  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   36  

3.  A  map  worker  reads  the  input  split.  It  parses  key/value  pairs  of  the  input  data  and  passes  each  pair  to  the  user-­‐defined  map  funcVon.    

(The  intermediate  key/value  pairs  produced  by  the  map  funcVon  are  buffered  in  memory.)  

Shuffle  phase  

Page 7: Google Query-‐Serving Architecture

1/20/11  

7  

MapReduce  Processing  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   37  

4.  Periodically,  the  buffered  pairs  are  wriGen  to  local  disk,  parVVoned  into  R  regions  by  the  parVVoning  funcVon.    

Shuffle  phase  

MapReduce  Processing  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   38  

5.  When  a  reduce  worker  has  read  all  intermediate  data  for  its  parVVon,  it  sorts  it  by  the  intermediate  keys  so  that  all  occurrences  of  the  same  key  are  grouped  together.  

(The  sorVng  is  needed  because  typically  many  different  keys  map  to  the  same  reduce  task  )  

Shuffle  phase  

MapReduce  Processing  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   39  

6.  Reduce  worker  iterates  over  sorted  intermediate  data  and  for  each  unique  intermediate  key,  it  passes  key  and  corresponding  set  of  values  to  the  user’s  reduce  funcVon.  

The  output  of  the  reduce  funcVon  is  appended  to  a  final  output  file  for  this  reduce  parVVon.  

Shuffle  phase  

MapReduce  Processing  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   40  

7.  When  all  map  tasks  and  reduce  tasks  have  been  completed,  the  master  wakes  up  the  user  program.  The  MapReduce  call  In  user  program  returns  back  to  user  code.    

Output  of  MR  is  in  R  output  files  (1  per  reduce  task,  with  file  names  specified  by  user);  o}en  passed  into  another  MR  job.  

Shuffle  phase  

MapReduce  Processing  Time  Line  

•  Master  assigns  map  +  reduce  tasks  to  “worker”  servers  •  As  soon  as  a  map  task  finishes,  worker  server  can  be  

assigned  a  new  map  or  reduce  task  •  Data  shuffle  begins  as  soon  as  a  given  Map  finishes  •  Reduce  task  begins  as  soon  as  all  data  shuffles  finish  •  To  tolerate  faults,  reassign  task  if  a  worker  server  “dies”  1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   41  

Show  MapReduce  Job  Running  

•  ~41  minutes  total  – ~29  minutes  for  Map  tasks  &  Shuffle  tasks  

– ~12  minutes  for  Reduce  tasks  – 1707  worker  servers  used  

•  Map  (Green)  tasks    read  0.8  TB,  write  0.5  TB  

•  Shuffle  (Red)  tasks    read  0.5  TB,  write  0.5  TB  

•  Reduce  (Blue)  tasks  read  0.5  TB,  write  0.5  TB  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   42  

Page 8: Google Query-‐Serving Architecture

1/20/11  

8  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   43   1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   44  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   45   1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   46  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   47   1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   48  

Page 9: Google Query-‐Serving Architecture

1/20/11  

9  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   49   1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   50  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   51   1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   52  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   53  

MapReduce  Processing:  ExecuVon  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   54  

Map  

Reduce  

Map  key  k1  hashes  to  Reduce  Task  2  Map  key  k2  hashes  to  Reduce  Task  1  Map  key  k3  hashes  to  Reduce  Task  2  Map  key  k4  hashes  to  Reduce  Task  1  Map  key  k5  hashes  to  Reduce  Task  1  

Shuffle  Reduce/Combine  Collect  Results  

Page 10: Google Query-‐Serving Architecture

1/20/11  

10  

Another  Example:  Word  Index    (How  O}en  Does  a  Word  Appear?)  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   55  

that  that  is  is  that  that  is  not  is  not  is  that  it  it  is  

is  1,  that  2     is  1,  that  2   is  2,  not  2   is  2,  it  2,  that  1  Map  1   Map  2   Map  3   Map  4  

Reduce  1   Reduce  2  

is  1     that  2  is  1,1     that  2,2  is  1,1,2,2  it  2    

that  2,2,1  not  2  

is  6;  it  2     not  2;  that  5  

Shuffle  

Collect  

is  6;  it  2;  not  2;  that  5    

Distribute  

MapReduce  Failure  Handling  

•  On  worker  failure:  – Detect  failure  via  periodic  heartbeats  –  Re-­‐execute  completed  and  in-­‐progress  map  tasks  

–  Re-­‐execute  in  progress  reduce  tasks  –  Task  compleVon  commiGed  through  master  

•  Master  failure:  –  Could  handle,  but  don't  yet  (master  failure  unlikely)  

•  Robust:  lost  1600  of  1800  machines  once,  but  finished  fine    

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   56  

MapReduce  Redundant  ExecuVon  

•  Slow  workers  significantly  lengthen  compleVon  Vme  – Other  jobs  consuming  resources  on  machine  –  Bad  disks  with  so}  errors  transfer  data  very  slowly  – Weird  things:  processor  caches  disabled  (!!)  

•  SoluVon:  Near  end  of  phase,  spawn  backup  copies  of  tasks  – Whichever  one  finishes  first  "wins"  

•  Effect:  DramaVcally  shortens  job  compleVon  Vme  –  3%  more  resources,  large  tasks  30%  faster  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   57  

Impact  on  ExecuVon  of  Restart,  Failure  for  10B  record  Sort  using  1800  servers    

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   58  

MapReduce  Locality  OpVmizaVon  

•  Master  scheduling  policy:  – Asks  GFS  (Google  File  System)  for  locaVons  of  replicas  of  input  file  blocks  

– Map  tasks  typically  split  into  64MB  (==  GFS  block  size)  – Map  tasks  scheduled  so  GFS  input  block  replica  are  on  same  machine  or  same  rack  

•  Effect:  Thousands  of  machines  read  input  at  local  disk  speed  

•  Without  this,  rack  switches  limit  read  rate  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   59  

Agenda  

•  Request  and  Data  Level  Parallelism  •  Administrivia  +  The  secret  to  gejng  good  grades  at  Berkeley  

•  MapReduce  

•  MapReduce  Examples  

•  Technology  Break  •  Costs    in  Warehouse  Scale  Computer  (if  Vme  permits)  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   60  

Page 11: Google Query-‐Serving Architecture

1/20/11  

11  

Design  Goals  of  a  WSC  

•  Unique  to  Warehouse-­‐scale  – Ample  parallelism:    

•  Batch  apps:  large  number  independent  data  sets  with  independent  processing.  Also  known  as  Data-­‐Level  Parallelism  

–  Scale  and  its  Opportuni:es/Problems  •  RelaVvely  small  number  of  these  make  design  cost  expensive  and  difficult  to  amorVze  

•  But  price  breaks  are  possible  from  purchases  of  very  large  numbers  of  commodity  servers  

•  Must  also  prepare  for  high  component  failures  – Opera:onal  Costs  Count:  

•  Cost  of  equipment  purchases  <<  cost  of  ownership  

1/20/11   Fall  2010  -­‐-­‐  Lecture  #37   61  

Internet  

WSC  Case  Study  Server  Provisioning  

1/20/11   Fall  2010  -­‐-­‐  Lecture  #37   62  

WSC Power Capacity 8.00 MW Power Usage Effectiveness (PUE) 1.45 IT Equipment Power Share 0.67 5.36 MW Power/Cooling Infrastructure 0.33 2.64 MW IT Equipment Measured Peak (W) 145.00 Assume Average Pwr @ 0.8 Peak 116.00 # of Servers 46207

# of Servers 46000 # of Servers per Rack 40.00 # of Racks 1150 Top of Rack Switches 1150 # of TOR Switch per L2 Switch 16.00 # of L2 Switches 72 # of L2 Switches per L3 Switch 24.00 # of L3 Switches 3

Rack  …  

Server  

TOR  Switch  

L3  Switch  

L2  Switch  …  

Cost  of  WSC  

•  US  account  pracVce  separates  purchase  price  and  operaVonal  costs  

•  Capital  Expenditure  (CAPEX)  is  cost  to  buy  equipment  (e.g.  buy  servers)  

•  OperaVonal  Expenditure  (OPEX)  is  cost  to  run  equipment  (e.g,  pay  for  electricity  used)  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   63  

WSC  Case  Study  Capital  Expenditure  (Capex)  

•  Facility  cost  and  total  IT  cost  look  about  the  same  

1/20/11   Fall  2010  -­‐-­‐  Lecture  #37   64  

Facility Cost $88,000,000 Total Server Cost $66,700,000 Total Network Cost $12,810,000 Total Cost $167,510,000

•  However,  replace  servers  every  3  years,  networking  gear  every  4  years,  and  facility  every  10  years  

Cost  of  WSC  

•  US  account  pracVce  allow  converVng  Capital  Expenditure  (CAPEX)  into  OperaVonal  Expenditure  (OPEX)  by  amorVzing  costs  over  Vme  period  – Servers  3  years    – Networking  gear  4  years  – Facility  10  years  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   65  

WSC  Case  Study  OperaVonal  Expense  (Opex)  

1/20/11   Fall  2010  -­‐-­‐  Lecture  #37   66  

Server 3 $66,700,000 $2,000,000 55% Network 4 $12,530,000 $295,000 8% Facility $88,000,000 Pwr&Cooling 10 $72,160,000 $625,000 17% Other 10 $15,840,000 $140,000 4% Amortized Cost $3,060,000 Power (8MW) $0.07 $475,000 13% People (3) $85,000 2% Total Monthly $3,620,000 100%

Monthly  Cost  Years  

AmorVzaVon  

$/kWh  

Amor:zed  Capital  Expense  

Opera:onal  Expense  

•  Monthly  Power  costs  •  $475k  for  electricity  •  $625k  +  $140k  to  amorVze  facility  power  distribuVon  and  cooling  •  60%  is  amorVzed  power  distribuVon  and  cooling  

Page 12: Google Query-‐Serving Architecture

1/20/11  

12  

How  much  does  a  waG  cost  in  a  WSC?  

•  8  MW  facility  •  AmorVzed  facility,  including  power  distribuVon  and  cooling  is  $625k  +  $140k  =  $765k  

•  Monthly  Power  Usage  =  $475k  

•  WaG-­‐Year  =  ($765k+$475k)*12/8M  =  $1.86  or  about  $2  per  year  

•  To  save  a  waG,  if  spend  more  than  $2  a  year,  lose  money  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   67  

Replace  RotaVng  Disks  with  Flash?  •  Flash  is  non-­‐volaVle  semiconductor  memory  – Costs  about  $20  /  GB,  Capacity  about  10  GB  – Power  about  0.01  WaGs  

•  Disk  is  non-­‐volaVle  rotaVng  storage  – Costs  about  $0.1  /  GB,  Capacity  about  1000  GB  – Power  about  10  WaGs  

•  Should  replace  Disk  with  Flash  to  save  money?  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   68  

A  red)    No:  Capex  Costs  are  100:1  of  OpEx  savings!  

B  orange)    Don’t  have  enough  informaCon  to  answer  quesCon  

C  green)    Yes:  Return  investment  in  a  single  year!  

Replace  RotaVng  Disks  with  Flash?  •  Flash  is  non-­‐volaVle  semiconductor  memory  – Costs  about  $20  /  GB,  Capacity  about  10  GB  – Power  about  0.01  WaGs  

•  Disk  is  non-­‐volaVle  rotaVng  storage  – Costs  about  $0.1  /  GB,  Capacity  about  1000  GB  – Power  about  10  WaGs  

•  Should  replace  Disk  with  Flash  to  save  money?  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   69  

A  red)    No:  Capex  Costs  are  100:1  of  OpEx  savings!  

B  orange)    Don’t  have  enough  informaCon  to  answer  quesCon  

C  green)    Yes:  Return  investment  in  a  single  year!  

WSC  Case  Study  OperaVonal  Expense  (Opex)  

•  $3.8M/46000  servers  =  ~$80  per  month  per  server  in  revenue  to  break  even  

•  ~$80/720  hours  per  month  =  $0.11  per  hour  •  So  how  does  Amazon  EC2  make  money???  1/20/11   Fall  2010  -­‐-­‐  Lecture  #37   70  

Server 3 $66,700,000 $2,000,000 55% Network 4 $12,530,000 $295,000 8% Facility $88,000,000 Pwr&Cooling 10 $72,160,000 $625,000 17% Other 10 $15,840,000 $140,000 4% Amortized Cost $3,060,000 Power (8MW) $0.07 $475,000 13% People (3) $85,000 2% Total Monthly $3,620,000 100%

Monthly  Cost  Years  

AmorVzaVon  

$/kWh  

Amor:zed  Capital  Expense  

Opera:onal  Expense  

January  2011  AWS  Instances  &  Prices  

•  Closest  computer  in  WSC  example  is  Standard  Extra  Large  

•  @$0.11/hr,  Amazon  EC2  can  make  money!  –  even  if  used  only  50%  of  Vme  

1/20/11   Fall  2010  -­‐-­‐  Lecture  #37   71  

Instance Per Hour

Ratio to

Small

Compute Units

Virtual Cores

Compute Unit/ Core

Memory (GB)

Disk (GB)

Address

Standard Small $0.085 1.0 1.0 1 1.00 1.7 160 32 bit Standard Large $0.340 4.0 4.0 2 2.00 7.5 850 64 bit Standard Extra Large $0.680 8.0 8.0 4 2.00 15.0 1690 64 bit High-Memory Extra Large $0.500 5.9 6.5 2 3.25 17.1 420 64 bit High-Memory Double Extra Large $1.000 11.8 13.0 4 3.25 34.2 850 64 bit High-Memory Quadruple Extra Large $2.000 23.5 26.0 8 3.25 68.4 1690 64 bit High-CPU Medium $0.170 2.0 5.0 2 2.50 1.7 350 32 bit High-CPU Extra Large $0.680 8.0 20.0 8 2.50 7.0 1690 64 bit Cluster Quadruple Extra Large $1.600 18.8 33.5 8 4.20 23.0 1690 64 bit

Summary  •  Request-­‐Level  Parallelism  –  High  request  volume,  each  largely  independent  of  other    –  Use  replicaVon  for  beGer  request  throughput,  availability  

•  MapReduce  Data  Parallelism  –  Divide  large  data  set  into  pieces  for  independent  parallel  processing  

–  Combine  and  process  intermediate  results  to  obtain  final  result    

•  WSC  CapEx  vs.  OpEx  –  Servers  dominate  cost  –  Spend  more  on  power  distribuVon  and  cooling  infrastructure  than  on  monthly  electricity  costs  

–  Economies  of  scale  mean  WSC  can  sell  compuVng  as  a  uVlity  

1/20/11   Apeinf  2011  -­‐-­‐  Lecture  #2   72