46
Mul$core Compu$ng Z. Huang University of Otago New Zealand COSC440 1 Mul$core Compu$ng

Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Embed Size (px)

Citation preview

Page 1: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Mul$-­‐core  Compu$ng  

Z.  Huang  University  of  Otago  

New  Zealand  

COSC440   1  Mul$-­‐core  Compu$ng  

Page 2: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Mul$-­‐core  age  (1)  

COSC440   2  Mul$-­‐core  Compu$ng  

Single    Processor  

Symmetrical  Mul$-­‐processor  (SMP)  

Chip  Mul$-­‐Threading    (CMT)  

1980   1990   2000   2010  

Page 3: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Mul$-­‐core  age  (2)  

•  Physical  limits  for  chip  manufacturing  –  Light  speed,  heat  dissipa$on,  power  consump$on  

•  Moore’s  law  is  s$ll  true!  – Moore’s  Law:  The  number  of  transistors  that  can  be  placed  in  a  chip  doubles  approximately  every  two  years  

–  A  silicon  die  can  accommodate  more  and  more  transistors  

•  New  direc$on  –  To  build  mul$ple  (hundreds  or  thousands)  less  powerful  cores  (CPUs)  into  a  single  chip  

 COSC440   Mul$-­‐core  Compu$ng   3  

Page 4: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Memory  wall  

•  Memory  wall  is  the  growing  disparity  of  speed  between  CPU  and  RAM  – From  1986  to  2000,  CPU  speed  improved  at  55%  annually  while  memory  speed  only  improved  at  10%.  

– Now  there  are  only  a  few  cycles  for  the  execu$on  of  one  instruc$on,  but  hundreds  of  cycles  for  memory  access  outside  chip.  

– 90%  CPU  $me  is  wai$ng  for  data  from  RAM  

COSC440   Mul$-­‐core  Compu$ng   4  

Page 5: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Mul$-­‐core  architectures  (1)  

•  Single  core  

COSC440   Mul$-­‐core  Compu$ng   5  

Interrupt  Logic  

CPU  State  

Execu$on  Units  

Cache  

Page 6: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Mul$-­‐core  architectures  (2)  

•  Mul$processor  (SMP)  

COSC440   Mul$-­‐core  Compu$ng   6  

Interrupt  Logic  

CPU  State  

Execu$on  Units  

Cache  

Interrupt  Logic  

CPU  State  

Execu$on  Units  

Cache  

Page 7: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Mul$-­‐core  architectures  (3)  •  Simultaneous  Mul$threading  (SMT)  

–  Also  called  Hyper-­‐Threading  Technology  (HTT)  as  Intel’s  trademarked  name  

–  SMT  uses  mul$ple  simultaneous  hardware  threads  (aka  strands)  to  u$lize  the  stalling  $me  such  as  cache  miss  and  branch  mispredic$on  in  a  processor.  

–  Can  help  hide  memory  wall.  –  E.g.  Intel  Xeon,  Pen$um  4  

COSC440   Mul$-­‐core  Compu$ng   7  

Interrupt  Logic  

CPU  State  

Execu$on  Units  

Cache  

Interrupt  Logic  

CPU  State  

Page 8: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Mul$-­‐core  architectures  (4)  

•  Mul$-­‐core  (CMP,  Chip  Mul$-­‐Processing)  

COSC440   Mul$-­‐core  Compu$ng   8  

Interrupt  Logic  

CPU  State  

Execu$on  Units  

Cache  

Interrupt  Logic  

CPU  State  

Execu$on  Units  

Cache  

Page 9: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Mul$-­‐core  architectures  (5)  •  Mul$-­‐core  with  shared  cache  (CMP)  •  E.g.  Intel  Core  Duo,  AMD  Barcelona  (Opteron  8300)  

COSC440   Mul$-­‐core  Compu$ng   9  

Interrupt  Logic  

CPU  State  

Execu$on    Units  

Interrupt  Logic  

CPU  State  

Execu$on    Units  

L2/L3  Cache  

Cache   Cache  

Page 10: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Mul$-­‐core  architectures  (6)  •  Mul$-­‐core  with  SMT  and  shared  L2  cache  (Sun  Niagara  CMT,  Chip  Mul$-­‐Threading)  

COSC440   Mul$-­‐core  Compu$ng   10  

Interrupt  Logic  

CPU  State  

Execu$on    Units  

L2  Cache  

Interrupt  Logic  

CPU  State  

Interrupt  Logic  

CPU  State  

Interrupt  Logic  

CPU  State  

Cache   Execu$on    Units   Cache  

Page 11: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

CMP,  SMT  and  CMT  

•  CMP  +  SMT  =  CMT  

COSC440   Mul$-­‐core  Compu$ng   11  

CMP   SMT   CMT  

Page 12: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Parallel  compu$ng  •  Parallel  compu$ng,  compu$ng  with  more  than  one  processor,  has  been  around  for  more  than  50  years  now.  

COSC440   Mul$-­‐core  Compu$ng   12  

Gill  writes  in  1958:  “...  There  is  therefore  nothing  new  in  the  idea  of  parallel  programming,  but  its  applica$on  to  computers.  The  author  cannot  believe  that  there  will  be  any  insuperable  difficulty  in  extending  it  to  computers.  It  is  not  to  be  expected  that  the  necessary  programming  techniques  will  be  worked  out  overnight.  Much  experimen$ng  remains  to  be  done.  Aeer  all,  the  techniques  that  are  commonly  used  in  programming  today  were  only  won  at  the  cost  of  considerable  toil  several  years  ago.  In  fact  the  advent  of  parallel  programming  may  do  something  to  revive  the  pioneering  spirit  in  programming  which  seems  at  the  present  to  be  degenera$ng  into  a  rather  dull  and  rou$ne  occupa$on  ...”  Gill,  S.  (1958),  “Parallel  Programming,”  The  Computer  Journal,  vol.  1,  April,  pp.  2-­‐10.  

Page 13: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

What  is  different  now?  

•  Parallel  computers  are  easily  accessible  – Cluster  computers  (Beowulf  cluster)  – Network  of  Worksta$ons  (NOW)  – Grid  compu$ng  resources  – Mul$-­‐core  

•  No  free  ride  on  CPU  speed.  CPU  speed  is  limited  by  – Heat  dissipa$on  problem,  and  – Eventually  light  speed.  

COSC440   Mul$-­‐core  Compu$ng   13  

Page 14: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Speedup  

•  The  performance  of  a  parallel  program  is  mainly  measured  with  speedup  – Speedup  =  Time  of  sequen:al  program  /  Time    of  parallel  program  

COSC440   Mul$-­‐core  Compu$ng   14  

Page 15: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Amdahl’s  Law  

•  Proposed  by  Gene  Amdahl  in  1967  •  Speedup  =  n  /  (1  +  (n  –  1)f),    

– where  f  is  the  frac$on  of  the  computa$on  that  cannot  be  parallelized  in  a  parallel  algorithm,  and  n  is  the  number  of  processors  working  on  the  concurrent  parts  of  the  algorithm.  

•  What  does  Amdahl’s  Law  tell  us?  – Even  with  an  infinite  number  of  processors,  the  maximum  speedup  is  bounded  by  1/f  

COSC440   Mul$-­‐core  Compu$ng   15  

Page 16: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Gustafson's  Law  •  Amdahl’s  Law  does  not  take  into  account  problem  size  

•  John  L.  Gustafson  proposed  the  following  formula  in  1988  –  Speedup  =  n  –  α  (n-­‐1)  – Where  n  is  the  number  of  processors,  and  α  is  the  $me  for  the  non-­‐parallelizable  part  of  the  parallel  computa$on  and  β  is  the  $me  for  the  parallel  part  (α+β=1  is  the  parallel  execu$on  $me)  

– When  α  diminishes  with  problem  size,  the  speedup  approaches  n  

COSC440   Mul$-­‐core  Compu$ng   16  

Page 17: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

A  metaphor  •  Suppose  a  person  is  traveling  between  two  ci$es  10  km  apart,  and  has  already  spent  one  hour  walking  half  the  distance  at  5  km/h.  

•  Amdahl's  Law  suggests:  –  “No  maper  how  fast  the  person  runs  the  last  half,  it  is  impossible  to  achieve  20  km/h  average  before  reaching  the  second  city.  Since  it  has  already  taken  1  hour  and  there  is  only  a  distance  of  10  km  total;  running  infinitely  fast  would  only  achieve  10  km/h.”    

•  Gustafson's  Law  argues:  –  “Given  enough  $me  and  distance  to  walk,  the  person's  average  speed  can  always  eventually  reach  20  km/h,  no  maper  how  long  the  person  has  already  walked.  In  the  two-­‐ci$es  case  this  could  be  achieved  by  running  at  25  km/h  for  addi$onal  three  hours.”  

COSC440   Mul$-­‐core  Compu$ng   17  

Page 18: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Counter-­‐Amdahl’s  Law  •  Suppose  f  is  the  frac$on  of  the  computa$on  that  is  serial  in  a  parallel  algorithm,  and  p  is  the  speedup  for  the  parallel  frac$on  of  the  algorithm.  Then,  if  p  >  (1  −  f)/f,  which  means  p  is  greater  than  the  ra$o  between  the  parallel  frac$on  and  the  serial  frac$on,  it  is  more  efficient  to  improve  the  serial  frac$on  rather  than  the  parallel  frac$on  in  order  to  increase  the  overall  speedup  of  the  algorithm.  

•  Detailed  proof  can  refer  to  –  Z.  Huang,  et  al,  Virtual  Aggregated  Processor  in  Mul$-­‐core  Computers,  NZHPC  Workshop,  In  Proceedings  of  the  Interna$onal  Conference  on  Parallel  Distributed  Compu$ng,  Applica$ons  and  Technologies  2008  (PDCAT08),  IEEE  Computer  Society,  Dunedin,  2008.    

COSC440   Mul$-­‐core  Compu$ng   18  

Page 19: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Modern  programming  models  •  SIMD  (Single  Instruc$on  Mul$ple  Data)  

–  The  same  instruc$on  is  executed  at  any  $me  by  many  cores  but  on  different  data  

–  E.g.  GPU,  GPGPU  with  languages  like  CUDA  and  OpenCL  •  SPMD  (Single  Program,  Mul$ple  Data)  

–  Mul$ple  processors  simultaneously  execute  the  same  program,  but  on  different  data  sets.  

–  Unlike  SIMD,  there  is  no  lockstep  at  instruc$on  level  and  can  thus  be  implemented  on  general  purposed  processors  

–  E.g.  MPI  1.0  and  VOPP  •  MPMD  (Mul$ple  Program,  Mul$ple  Data)  

–  Mul$ple  processors  simultaneously  execute  different  programs  on  different  data  sets.  

–  Can  spawn  new  processes/threads  during  execu$on.  –  E.g.  OpenMP,  PVM,  and  MPI  2.0  

COSC440   Mul$-­‐core  Compu$ng   19  

Page 20: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Paralleliza$on  methods  (1)  

•  Data  decomposi$on  – Breaks  down  a  task  by  the  data  it  works  on  

•  Example  – Matrix  addi$on  A+B=C  

COSC440   Mul$-­‐core  Compu$ng   20  

A1  

A5  

A9  

A13  

A2   A3   A4  

A6   A7   A8  

A10   A11   A12  

A14   A15   A16  

B1  

B5  

B9  

B13  

B2   B3   B4  

B6   B7   B8  

B10   B11   B12  

B14   B15   B16  

C1  

C5  

C9  

C13  

C2   C3   C4  

C6   C7   C8  

C10   C11   C12  

C14   C15   C16  

+   =  

Page 21: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Paralleliza$on  methods  (2)  •  Task  decomposi$on  

–  Decompose  a  task  by  the  func$ons  it  performs.  •  Example  

–  Gardening  decomposed  into  mowing,  weeding,  and  trimming  –  The  task  of  a  simulated  soccer  game  can  be  divided  into  22  sub-­‐tasks,  each  of  which  simulate  one  player.  

COSC440   Mul$-­‐core  Compu$ng   21  

Page 22: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Paralleliza$on  methods  (3)  •  Data  flow  decomposi$on  

– Decompose  a  task  by  how  data  flows  between  different  stages.  

•  Example  –  Producer/consumer  problem,  pipelining  

COSC440   Mul$-­‐core  Compu$ng   22  

F1   F2   F3   F4  

Page 23: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Paralleliza$on  methods  (4)  •  Divide  and  Conquer  

– Decompose  a  task  into  sub-­‐tasks  dynamically  during  execu$on.  Oeen  task  queues  are  used.  

•  Example  – mergesort  

COSC440   Mul$-­‐core  Compu$ng   23  

N1  

N2   N3  

N4   N5  

Page 24: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Mul$threading  

•  Threads  are  divided  into  –  Heavy-­‐weight,  tradi$onally  called  processes  –  Light-­‐weight,  also  called  light-­‐weight  processes  

•  The  light-­‐weight  threads  –  Unlike  processes  which  have  individual  address  spaces,  they  share  the  same  address  space  and  the  following  

•  process  instruc$ons,  heap,  open  files  (desrciptors),  signal  handlers,  current  working  directory,  user  and  group  IDs  

–  However,  LWT  has  its  own  of  the  following  •  thread  ID,  set  of  registers  (including  program  counter  and  stack  pointer),  stack  (for  local  variables  and  return  addresses),  signal  mask,  priority  

COSC440   Mul$-­‐core  Compu$ng   24  

Page 25: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Shared  memory  

•  No  maper  with  LWT  or  HWT,  if  shared  memory  space  is  used,  we  need  to  deal  with  the  data  race  issue  

•  Race  condi$ons  – Determinacy  race:  A  race  condi$on  that  occurs  when  two  threads  access  the  same  memory  loca$on  and  at  least  one  of  the  threads  performs  a  write.  

– Data  race:  A  determinacy  race  that  occurs  without  mutual  exclusive  mechanisms  protec$ng  the  involved  memory  loca$on.  

COSC440   Mul$-­‐core  Compu$ng   25  

Page 26: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Data  race  (1)  

•  Data  race  causes  bugs  in  programs.  •  Example  

–  Suppose  two  threads  execute  the  same  statement,  and  X  is  0  ini$ally  

COSC440   Mul$-­‐core  Compu$ng   26  

     T1                                            T2    X=X+1                              X=X+1    

Page 27: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Data  race  (2)  

•  In  machine  language,  the  two  threads  will  execute  the  following  instruc$ons:  

COSC440   Mul$-­‐core  Compu$ng   27  

         T1                                              T2    LOAD  X,  R1                  LOAD  X,  R2  INC  R1                                  INC  R2  STORE  R1,  X              STORE  R2,  X      

Page 28: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Data  race  (3)  •  In  a  concurrent  environment,  there  is  no  guarantee  regarding  in  

which  order  the  instruc$ons  of  the  two  threads  are  executed  •  Suppose  the  instruc$ons  are  executed  in  the  following  order:  

COSC440   Mul$-­‐core  Compu$ng   28  

         T1                                              T2    LOAD  X,  R1                                                                          LOAD  X,  R2                                                        INC  R2                                                        STORE  R2,  X  INC  R1  STORE  R1,  X    

R1  is  0   R2  is  0  

R2  is  1  

X  is  1  R1  is  1  

X  is  1  

The  final  result  of  X  is  NOT  what  we  expected!  

T1  is  suspended  for  some  reason  

Page 29: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Solu$on  for  data  race  

•  The  problem  of  the  previous  example  is  the  viola$on  of  atomicity  of  X=X+1  –  The  execu$on  should  be  uninterrup$ble  by  compe$tors.  –  Atomicity  means  the  set  of  opera$ons  should  have  all-­‐or-­‐nothing  feature:  they  should  all  be  completed  or  none  of  them  is  done  before  other  compe$ng  opera$ons  start.  

•  There  are  two  solu$ons  to  guarantee  atomicity  – Mutual  exclusion  (pessimis$c  approach)  –  Transac$onal  memory  (op$mis$c  approach)  

COSC440   Mul$-­‐core  Compu$ng   29  

Page 30: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Mutual  exclusion  •  Atomic  variables  

–  Use  hardware  supported  CAS  (conpare-­‐and-­‐swap)  instruc$on  •  Semaphore  

–  An  integer  variable  S  (≥0)  which  is  only  accessed  by  two  atomic  opera$ons:  down  (P,  test  and  decrease)  and  up  (V,  increase)  

–  If  S  is  0,  down  has  to  wait.  •  Mutex  

– When  a  semaphore  is  ini$alized  to  1,  it  is  called  mutex,  which  allows  only  one  thread  to  get  in  with  down  at  any  $me  

–  It  is  generally  called  lock,  which  is  the  only  interes$ng  applica$on  of  a  semaphore.  

–  The  code  sec$on  protected  by  mutex  through  down  and  up  is  oeen  called  cri$cal  sec$on  

COSC440   Mul$-­‐core  Compu$ng   30  

Page 31: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Deadlock    

•  When  locks  are  nested,  deadlock  can  happen.  – Deadlock  is  a  situa$on  where  threads  are  wai$ng  for  locks  that  will  never  be  released.  

COSC440   Mul$-­‐core  Compu$ng   31  

Acquire  lock  1   Acquire  lock  2  

Acquire  lock  2   Acquire  lock  1  

T1   T2  Acquired  lock  1  

Wait  for  lock  2  forever  

Acquired  lock  2  

Wait  for  lock  1  forever  

Page 32: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Synchroniza$on  (1)  •  Barrier  is  oeen  used  to  synchronize  threads.  

– When  a  barrier  is  called  in  a  thread,  it  is  blocked  un$l  all  other  threads  arrive  at  the  barrier.  

–  Similar  mechanism  like  join  is  used  in  Pthreads  •  Barrier  can  be  used  to  avoid  data  race  as  well.  

COSC440   Mul$-­‐core  Compu$ng   32  

Array  A  

Work  on  red  part  of  A   Work  on  blue  part  of  A  

barrier   barrier  Work  on  blue  part  of  A   Work  on  red  part  of  A  

T1   T2  

Page 33: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Synchroniza$on  (2)  

•  In  message  passing  based  programming,  messages  are  used  for  synchroniza$on.  

COSC440   Mul$-­‐core  Compu$ng   33  

Send_to(T2)   Recv_from(T1)  

T1   T2  

Recv_from(T1)  

T2  is  blocked  un$l  the  message  from  

T1  arrives    

Page 34: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Determinacy  race  •  Even  with  locks,  there  exists  a  kind  of  non-­‐determinacy  in  

parallel  programs  called  determinacy  race,  which  will  cause  the  non-­‐determinate  results  of  parallel  programs  

COSC440   Mul$-­‐core  Compu$ng   34  

Acquire  lock  1   Acquire  lock  1  

X=X+100   prinv(“The  value  of  X  is  %d\n”,  X  Release  lock  1   Release  lock  1  

T1   T2  Suppose  X=0  

ini$ally  

The  printed  result  of  X  is  not  determined,  depending  on  which  thread  acquires  the  lock  first!  

Page 35: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Processor  op$miza$ons  •  Memory  ordering  is  oeen  op$mized  for  performance  reason  –  For  example,  for  two  writes  W1  and  W2  from  P1  can  be  seen  by  P1  as  W1-­‐>W2,  but  by  P2  as  an  order  W2-­‐>W1  (which  is  allowed  by  Par$al  Store  Order,  PSO  in  SPARC)  

–  In  Intel  processors,  if  P1  writes  W1  and  P2  writes  W2,  it  is  possible  for  P1  to  observe  W1-­‐>W2  while  P2  observes  W2-­‐>W1  

•  Read  opera$ons  can  overtake  previous  write  opera$ons  wai$ng  to  be  completed  in  the  write  buffer.  

•  These  op$miza$ons  can  cause  problems  for  programmers  assuming  Sequen$al  Consistency.  

COSC440   Mul$-­‐core  Compu$ng   35  

Page 36: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Sequen$al  Consistency  •  SC  defined  by  Lamport  in  1979  

–  The  result  of  any  execu$on  is  the  same  as  if  the  opera$ons  of  all  processes  were  executed  in  some  global  sequen$al  order,  and  the  opera$ons  of  each  individual  process  appear  in  this  sequence  in  the  order  specified  by  its  own  program  

–  Example    

COSC440   Mul$-­‐core  Compu$ng   36  

Ini$ally  X=Y=0;        P1    X  =  1;  T1  =  Y;      

       P2    Y  =  2;  T2  =  X;  

What  are  the  possible    Values  of  T1  and  T2?  

T1                          T2    2                              1    2                              0    0                              1    0                              0  

✔ ✔

Page 37: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Example  1  

COSC440   Mul$-­‐core  Compu$ng   37  

What  value  can  T3  get  from  X?  

     T1    X  =  1;    

Ini$ally  X=Y=0;  

                 T2    while(X!=1)  ;  Y  =  1;  

               T3    while(Y!=1);  tmp  =  X;  

Suppose  T1,  T2,  and  T3  run  on  C1,  C2,  and  C3  respec$vely.    Since  the  updates  on  X  and  Y  can  be  seen  by  C3  as  Y-­‐>X,  it  is  possible  for  T3  to  see  Y:1  and  execute  tmp=X  before  X:1  arrives.  

1  

0?  

Page 38: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Example  2  

COSC440   Mul$-­‐core  Compu$ng   38  

                           T1    flag1  =  1;  If(flag2  ==  0)              cri$cal  sec$on;  

                           T2    flag2  =  1;  If(flag1  ==  0)              cri$cal  sec$on;  

Ini$ally  flag1  and  flag2  are  set  0  

However,  if  a  read  opera$on  can  overtake  write  opera$ons,  the  cri$cal  sec$on  may  be  executed  by  both  threads  at  the  same  $me.  

The  cri$cal  sec$on  should  be  executed  by  at  most  one  thread    at  any  $me.  

A  possible  order:  T2  read  flag1(0);  T1  write  flag1(1);  T1  read  flag2(0)  T2  write  flag2  (1)  

Page 39: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Solu$on-­‐-­‐fence  •  A  fence  instruc$on  (called  barrier  in  Linux)  is  oeen  used  to  

avoid  previous  problems.  •  At  execu$on  $me,  a  fence  instruc$on  guarantees  

completeness  of  all  pre-­‐fence  memory  opera$ons  and  delays  all  post-­‐fence  memory  opera$ons  un$l  the  comple$on  of  fence  instruc$on  cycles  •  Fences  have  performance  problems.  Intel  has  mul$ple  kinds  of  fences  to  relieve  the  problems.  

•  The  following  gives  an  example  of  fence:  

COSC440   Mul$-­‐core  Compu$ng   39  

                           T1    flag1  =  1;  fence;  If(flag2  ==  0)              cri$cal  sec$on;  

                           T2    flag2  =  1;  fence;  If(flag1  ==  0)              cri$cal  sec$on;  

Page 40: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Cache  coherence  (1)  

COSC440   Mul$-­‐core  Compu$ng   40  

C1   C2   C3   C4  

A|old   A|old   A|new   A|old  

BUS  

A|old  

Cache  

Core  

Memory  

Page 41: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Cache  coherence  (2)  

•  Cache  coherence  protocol  is  used  to  propagate  a  write  to  other  copies  

•  There  are  two  basic  protocols  –  Invalidate:  remove  old  copies  from  other  caches  – Update:  update  old  copies  in  other  caches  to  the  new  value  

•  A  cache  line  is  used  as  the  basic  unit  to  invalidate  or  update  caches.  

•  Normally  an  invalidate  protocol  is  used  in  processors  

COSC440   Mul$-­‐core  Compu$ng   41  

Page 42: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

False  sharing  •  A  false  sharing  occurs  when  two  independent  variables  are  located  in  the  same  cache  line  –  A  cache  line  in  modern  processors  can  be  up  to  64  bytes  or  more  

•  False  sharing  can  cause  unnecessary  performance  loss    •  Suppose  eight  threads  work  on  an  array  A[8]  in  parallel,  which  happen  to  be  in  the  same  cache  line.  

COSC440   Mul$-­‐core  Compu$ng   42  

           T1    A[0]  =  0;  

           T2    A[1]  =  1;  

           T8    A[7]  =  7;  

…  The  performance  will  be  the  same  as  if  these  statements  are  executed  serially,  since  each  $me  a  thread  has  to  own  the  cache  line  before  the  write  access.  Even  worse,  each  $me  the  copies  of  the  cache  line  in  other  caches  are  invalidated.  Solu$on:  padding!  

Page 43: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Priority  inversion  (1)  •  Locking  can  cause  priority  inversion  in  real  $me  systems  –  priority  inversion  is  a  situa$on  where  a  low  priority  task  holds  a  lock  that  is  required  by  a  high  priority  task,  which  causes  the  high  priority  task  to  be  blocked  un$l  the  low  priority  task  has  released  the  lock  and  thus  invert  the  priori$es  of  the  two  tasks.    

•  A  typical  priority  inversion  was  experienced  by  the  Mars  lander  “Mars  Pathfinder”.  –  There  three  tasks:    

•  Data  gathering  task  (DGT):  low  priority,  infrequent,  short  running;  •  Bus  management  task  (BMT):  high  priority,  frequent;  •  Communica$ons  task  (CT):  medium  priority,  infrequent,  long  running;  

–  GDT  and  BMT  share  an  informa$on  bus  and  synchronized  with  a  mutex.  

COSC440   Mul$-­‐core  Compu$ng   43  

Page 44: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Priority  inversion  (2)  •  The  following  scenario  causes  frequent  reset  of  the  pathfinder:  

COSC440   Mul$-­‐core  Compu$ng   44  

Bus  management  task:  blocked  by  the  data  

gathering  task  which  is  holding  the  lock  

Data  gathering  task:  preempted  by  the  

communica$ons  task  which  is  long  running    

Communica$ons  task:  enjoying  the  CPU  cycles  

Eventually  a  watchdog  $mer  would  go  off,  no$ce  that  the  bus  management  task  had  not  been  executed  for  some  $me,  conclude  that  something  had  gone  dras$cally  wrong,  and  ini$ate  a  total  system  reset.  

Page 45: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Transac$onal  memory  •  TM  is  a  good  solu$on  to  deadlock  and  priority  inversion,  

though  it  s$ll  has  data  race  issue  and  livelock  problem.  •  The  idea  is  to  treat  shared  memory  opera$ons  as  

transac$ons,  which  can  be  commiped  or  aborted  but  rolled  back  for  execu$on  again.  –  Opera$on  conflicts  such  as  W/W  and  R/W  need  to  be  detected  by  the  conflict  manager  in  order  to  commit  or  abort  a  transac$on.    

COSC440   Mul$-­‐core  Compu$ng   45  

                 T1    Begin_transac$on;  X  =  1  ;  if  (Y  ==  0)  {};  End_transac$on;  

                 T2    Begin_transac$on;  Y  =  1  ;  if    (X  ==  0)  {};  End_transac$on;  

Page 46: Mul$%core*Compu$ng* - University of Otago · Mul$%core*Compu$ng* Z.*Huang* University*of*Otago* New*Zealand* COSC440 Mul$%core*Compu$ng* 1

Livelock  •  If  a  memory  loca$on  is  heavily  contended,  it  is  possible  for  transac$ons  to  abort  each  other,  which  causes  livelock  

•  A  livelock  is  a  situa$on  where  transac$ons  abort  each  other,  resul$ng  in  no  useful  progress  at  all  for  any  transac$on.  

COSC440   Mul$-­‐core  Compu$ng   46  

T1   T2  

T1  

Rollback  

T2  

Abort  

Abort  

Rollback