42
HAT: Heterogeneous Adap1ve Thro4ling for OnChip Networks Kevin KaiWei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu

HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

HAT:  Heterogeneous  Adap1ve  Thro4ling  for  On-­‐Chip  Networks  

Kevin  Kai-­‐Wei  Chang    Rachata  Ausavarungnirun  

Chris  Fallin    Onur  Mutlu  

     

Page 2: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Execu1ve  Summary  Problem:  Packets  contend  in  on-­‐chip  networks  (NoCs),  causing  conges@on,  thus  reducing  system  performance  Approach:  Source  throFling  (temporarily  delaying  packet  injec@ons)  to  reduce  conges@on  1)  Which  applica1ons  to  thro4le?  Observa@on:  ThroFling  network-­‐intensive  applica@ons  leads  to  higher  system  performance  èKey  idea  1:  Applica@on-­‐aware  source  throFling    2)  How  much  to  thro4le?  Observa@on:  There  is  no  single  thro4ling  rate  that  works  well  for  every  applica@on  workload    èKey  idea  2:  Dynamic  throFling  rate  adjustment  Result:  Improves  both  system  performance  and  energy  efficiency  over  state-­‐of-­‐the-­‐art  source  throFling  policies  

2  

Page 3: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Outline  • Background  and  Mo1va1on  • Mechanism  • Comparison  Points  • Results  • Conclusions  

3  

Page 4: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

On-­‐Chip  Networks  

4  

PE  

Processing  Element  (Cores,  L2  Banks,  Memory  Controllers,  etc)  PE  

•  Connect  cores,  caches,  memory  controllers,  etc.  –  Buses  and  crossbars  are  not  scalable  

PE   PE   PE  …  

IInterconnect  

Page 5: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

On-­‐Chip  Networks  

5  

R PE  

R PE  

R PE  

R PE  

R PE  

R PE  

R PE  

R PE  

R PE  

R Router  

Processing  Element  (Cores,  L2  Banks,  Memory  Controllers,  etc)  PE  

•  Connect  cores,  caches,  memory  controllers,  etc  –  Buses  and  crossbars  are  not  scalable  

•  Packet  switched  •  2D  mesh:  Most  commonly  used  topology  

•  Primarily  serve  cache  misses  and  memory  requests  

Page 6: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Network  Conges1on  Reduces  Performance  

6  

Network  conges1on:    êsystem  performance  

R

PE  

R

PE  

R

PE  

R

PE  

R

PE  

R

PE  

R

PE  

R

PE  

R

PE  

P   P   P  

P  

P  

Conges1on  

R Router  Processing  Element  (Cores,  L2  Banks,  Memory  Controllers,  etc)  PE  

P   Packet  

Limited  shared  resources  (buffers  and  links)  •   due  to  design  constraints:  Power,  chip  area,  and  @ming  

Page 7: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Goal  •  Improve  system  performance  in  a  highly  congested  network  

•  Observa1on:  Reducing  network  load  (number  of  packets  in  the  network)  decreases  network  conges@on,  hence  improves  system  performance  

•  Approach:  Source  throFling  (temporarily  delaying  new  traffic  injec@on)  to  reduce  network  load  

7  

Page 8: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

PE   PE   PE  …  

Source  Thro4ling  

8  

I  

Network  Processing  Element  (Cores,  L2  Banks,  Memory  Controllers,  etc)  PE  

P   Packet  I  

P  P  

P  

P   P  

P  

P  

P  P  

P  

P  

P  P  

P   P  

Long  network  latency  when  the  network  is  congested  

Page 9: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

PE   PE   PE  …  

Source  Thro4ling  

9  

I  

Network  Processing  Element  (Cores,  L2  Banks,  Memory  Controllers,  etc)  PE  

P   Packet  I  

P  P  

P  

P   P  

P  

P  

P  P  

P  

P  

P  P  

P   P   Thro4le  Thro4le  Thro4le  P  

•   ThroFling  makes  some  packets  wait  longer  to  inject    •   Average  network  throughput  increases,  hence  higher              system  performance  

P  P  

P  P  

P  P  

Page 10: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Key  Ques1ons  of  Source  Thro4ling  •  Every  cycle  when  a  node  has  a  packet  to  inject,  source  throFling  blocks  the  packet  with  probability  P  –  We  call  P  “thro4ling  rate”  (ranges  from  0  to  1)  

•  Thro4ling  rate  can  be  set  independently  on    every  node  

Two  key  ques1ons:  1.  Which  applica@ons  to  throFle?  2.  How  much  to  throFle?  Naïve  mechanism:  ThroFle  every  single  node  with  a  constant  throFling  rate    

10  

Page 11: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Key  Observa1on  #1  

11  

ThroFling  network-­‐intensive  applica@ons  leads  to  higher  system  performance    Configura1on:  16-­‐node  system,  4x4  mesh  network,  8  gromacs  (network-­‐non-­‐intensive),  and  8  mcf  (network-­‐intensive)  

gromacs   …   mcf  

I  P  

P  

P    P  

P   P  P  

P  P  P  

Thro4le  

ThroFling  gromacs  decreases  system  performance  by  2%  due  to  minimal  network  load  reduc@on  

Page 12: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

ThroFling  mcf  increases  system  performance  by  9%  (gromacs:  +14%  mcf:  +5%)  due  to  reduced  conges@on  

gromacs   …  

Key  Observa1on  #1  

12  

ThroFling  network-­‐intensive  applica@ons  leads  to  higher  system  performance    Configura1on:  16-­‐node  system,  4x4  mesh  network,  8  gromacs  (network-­‐non-­‐intensive),  and  8  mcf  (network-­‐intensive)  

mcf  

I  P  

P  

P  P  

P   P  P  

P  P  P  

Thro4le  

Page 13: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

gromacs   …  

Key  Observa1on  #1  

13  

ThroFling  network-­‐intensive  applica@ons  leads  to  higher  system  performance    Configura1on:  16-­‐node  system,  4x4  mesh  network,  8  gromacs  (network-­‐non-­‐intensive),  and  8  mcf  (network-­‐intensive)  

mcf  

I  P  

P  P  

P  P  

Thro4le  

•   ThroFling  mcf  reduces  conges@on  •   gromacs  benefits  more  from  less  network  latency    

•   ThroFling  network-­‐intensive  applica@ons  reduces          conges@on  •   Benefits  both  network-­‐non-­‐intensive  and          network-­‐intensive  applica@ons  

Page 14: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

mcf  

Key  Observa1on  #2  There  is  no  single  thro4ling  rate  that  works  well  for  every  applica@on  workload    

Network  runs  best  at  or  below  a  certain  network  load  Dynamically  adjust  throFling  rate  to  avoid  overload  and  under-­‐u@liza@on    

14  

…  mcf  

I   P  P  

P  

P  P   P  

P  P  

gromacs  gromacs  

I  P   P  

P  

Thro4le  0.8  

…  Thro4le  0.8   Thro4le  0.8   Thro4le  0.8  

Page 15: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Outline  • Background  and  Mo@va@on  • Mechanism  • Comparison  Points  • Results  • Conclusions  

15  

Page 16: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Heterogeneous  Adap1ve  Thro4ling  (HAT)  1.   Applica1on-­‐aware  thro4ling:  

ThroFle  network-­‐intensive  applica@ons  that  interfere  with  network-­‐non-­‐intensive  applica@ons  

2.   Network-­‐load-­‐aware  thro4ling  rate  adjustment:  Dynamically  adjust  throFling  rate  to  adapt  to  different  workloads  and  program  phases  

16  

Page 17: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Heterogeneous  Adap1ve  Thro4ling  (HAT)  1.   Applica1on-­‐aware  thro4ling:  

ThroFle  network-­‐intensive  applica@ons  that  interfere  with  network-­‐non-­‐intensive  applica@ons  

2.   Network-­‐load-­‐aware  thro4ling  rate  adjustment:  Dynamically  adjust  throFling  rate  to  adapt  to  different  workloads  and  program  phases  

17  

Page 18: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Applica1on-­‐Aware  Thro4ling  1.   Measure  applica1ons’  network  intensity    

2.   Thro4le  network-­‐intensive  applica1ons  

   

18  

Network-­‐non-­‐intensive                        (Unthro4led)  

How  to  select  unthroFled  applica@ons?  

Σ  MPKI  <  Threshold Higher  L1  MPKI    

Use  L1  MPKI  (misses  per  thousand  instruc@ons)  to  es@mate  network  intensity  

•   Leaving  too  many  applica@ons  unthroFled  overloads  the  network  èSelect  unthroFled  applica@ons  so  that  their  total  network                    intensity  is  less  than  the  total  network  capacity  

Network-­‐intensive                  (Thro4led)  

App  

Page 19: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Heterogeneous  Adap1ve  Thro4ling  (HAT)  1.   Applica1on-­‐aware  thro4ling:  

ThroFle  network-­‐intensive  applica@ons  that  interfere  with  network-­‐non-­‐intensive  applica@ons  

2.   Network-­‐load-­‐aware  thro4ling  rate  adjustment:  Dynamically  adjust  throFling  rate  to  adapt  to  different  workloads  and  program  phases  

19  

Page 20: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

•  Different  workloads  require  different  throFling  rates  to  avoid  overloading  the  network  

•  But,  network  load  (frac@on  of  occupied  buffers/links)  is  an  accurate  indicator  of  conges@on    

•  Key  idea:  Measure  current  network  load  and  dynamically  adjust  throFling  rate  based  on  load  if network load > target:

Increase throttling rate else:

Decrease throttling rate

Dynamic  Thro4ling  Rate  Adjustment  

20  

If  network  is  congested,  throFle  more  

If  network  is  not  congested,  avoid    unnecessary  throFling  

Page 21: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Heterogeneous  Adap1ve  Thro4ling  (HAT)  1.   Applica1on-­‐aware  thro4ling:  

ThroFle  network-­‐intensive  applica@ons  that  interfere  with  network-­‐non-­‐intensive  applica@ons  

2.   Network-­‐load-­‐aware  thro4ling  rate  adjustment:  Dynamically  adjust  throFling  rate  to  adapt  to  different  workloads  and  program  phases  

21  

Page 22: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Epoch-­‐Based  Opera1on  •  Applica1on  classifica1on  and  thro4ling  rate  adjustment  are  expensive  if  done  every  cycle  

•  Solu1on:  recompute  at  epoch  granularity  

22  

Time  Current  Epoch  (100K  cycles)  

Next  Epoch  (100K  cycles)  

During  epoch:  Every  node:  1) Measure  L1  MPKI    2) Measure  network  load  

Beginning  of  epoch:  All  nodes  send  measured  info  to  a  central  controller,  which:  1)  Classifies  applica@ons  2)  Adjusts  throFling  rate  3)  Sends  new  classifica@on  and  

throFling  rate  to  each  node  

Page 23: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Pugng  It  Together:  Key  Contribu1ons  1.  Applica1on-­‐aware  thro4ling  

– ThroFle  network-­‐intensive  applica@ons  based  on  applica@ons’  network  intensi@es  

2.  Network-­‐load-­‐aware  thro4ling  rate  adjustment  – Dynamically  adjust  throFling  rate  based  on  network  load  to  avoid  overloading  the  network  

23  

HAT  is  the  first  work  to  combine  applica@on-­‐aware  throFling  and  network-­‐load-­‐aware  rate  adjustment  

Page 24: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Outline  • Background  and  Mo@va@on  • Mechanism  • Comparison  Points  • Results  • Conclusions  

24  

Page 25: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Comparison  Points  •  Source  thro4ling  for  bufferless  NoCs    

[Nychis+  Hotnets’10,  SIGCOMM’12]  

–  ThroFle  network-­‐intensive  applica@ons  when  other  applica@ons  cannot  inject  

–  Does  not  take  network  load  into  account  – We  call  this  “Heterogeneous  ThroFling”  

•  Source  thro4ling  for  buffered  networks    [ThoFethodi+  HPCA’01]  –  ThroFle  every  applica@on  when  the  network  load  exceeds  a  dynamically  tuned  threshold  

–  Not  applica@on-­‐aware  –  Fully  blocks  packet  injec@ons  while  throFling  – We  call  this  “Self-­‐Tuned  ThroFling”  

25  

Page 26: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Outline  • Background  and  Mo@va@on  • Mechanism  • Comparison  Points  • Results  • Conclusions  

26  

Page 27: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Methodology  •  Chip  Mul1processor  Simulator  

–  64-­‐node  mul@-­‐core  systems  with  a  2D-­‐mesh  topology  –  Closed-­‐loop  core/cache/NoC  cycle-­‐level  model  –  64KB  L1,  perfect  L2  (always  hits  to  stress  NoC)  

•  Router  Designs  –  Virtual-­‐channel  buffered  router:  4  VCs,  4  flits/VC    

[Dally+  IEEE  TPDS’92]  •  Input  buffers  to  hold  contending  packets  

–  Bufferless  deflec1on  router:  BLESS  [Moscibroda+  ISCA’09]  •  Misroute  (deflect)  contending  packets  

27  

Page 28: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Methodology  •  Workloads  

–  60  mul@-­‐core  workloads  of  SPEC  CPU2006  benchmarks  –  4  network-­‐intensive  workload  categories  based  on  the  network  intensity  of  applica@ons  (Low/Medium/High)  

•  Metrics  System  performance:    Fairness:    Energy  efficiency:  

28  €

MaximumSlowdown =maxiIPCi

alone

IPCishared

PerfPerWatt =WeightedSpeedup

Power

Weighted Speedup =IPCi

shared

IPCialone

i∑

Page 29: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

0  5  

10  15  20  25  30  35  40  45  50  

HL   HML   HM   H   AVG  

Weighted  Speedu

p  

Workload  Categories  

No  ThroFling  Heterogeneous  HAT  

Performance:  Bufferless  NoC  (BLESS)  

29  

1.  HAT  provides  beFer  performance  improvement  than          state-­‐of-­‐the-­‐art  throFling  approaches  2.  Highest  improvement  on  heterogeneous  workloads            -­‐  L  and  M  are  more  sensi1ve  to  network  latency  

+  7.4%  

+  9.4%   +  11.5%  

+  11.3%  

+  8.6%  

Page 30: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

0  5  

10  15  20  25  30  35  40  45  50  

HL   HML   HM   H   AVG  

Weighted  Speedu

p  

Workload  Categories  

No  ThroFling  Self-­‐Tuned  Heterogeneous  HAT  

Performance:  Buffered  NoC  

30  

HAT  provides  beFer  performance  improvement  than  prior  approaches  

+  3.5%  

Page 31: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

0.0  

0.2  

0.4  

0.6  

0.8  

1.0  

1.2  

�Buffered  

No  ThroFling   Self-­‐Tuned  Heterogeneous   HAT  

Applica1on  Fairness  

31  HAT  provides  beFer  fairness  than  prior  works  

-­‐  15%  -­‐  5%  

0.0  

0.2  

0.4  

0.6  

0.8  

1.0  

1.2  

BLESS  

Normalized

 Maxim

um  Slowdw

on  

(Low

er  is  be4

er)  

No  ThroFling   Heterogeneous  HAT  

Page 32: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

0.0  

0.2  

0.4  

0.6  

0.8  

1.0  

1.2  

BLESS   Buffered  

Normalized

 Perf.  pe

r  Wa4

 No  ThroFling   HAT  

Network  Energy  Efficiency  

32  

+  8.5%   +  5%  

HAT  increases  energy  efficiency  by  reducing  network  load  by  4.7%  (BLESS)  or  0.5%  (buffered)  

Page 33: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Other  Results  in  Paper  •  Performance  on  CHIPPER  [Fallin+  HPCA’11]  

– HAT  improves  system  performance  

•  Performance  on  mul1threaded  workloads  – HAT  is  not  designed  for  mul@threaded  workloads,  but  it  slightly  improves  system  performance  

•  Parameter  sensi@vity  sweep  of  HAT  – HAT  provides  consistent  system  performance  improvement  on  different  network  sizes  

  33  

Page 34: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Conclusions  Problem:  Packets  contend  in  on-­‐chip  networks  (NoCs),  causing  conges@on,  thus  reducing  system  performance  Approach:  Source  throFling  (temporarily  delaying  packet  injec@ons)  to  reduce  conges@on  1)  Which  applica1ons  to  thro4le?  Observa@on:  ThroFling  network-­‐intensive  applica@ons  leads  to  higher  system  performance  èKey  idea  1:  Applica@on-­‐aware  source  throFling    2)  How  much  to  thro4le?  Observa@on:  There  is  no  single  thro4ling  rate  that  works  well  for  every  applica@on  workload    èKey  idea  2:  Dynamic  throFling  rate  adjustment  Result:  Improves  both  system  performance  and  energy  efficiency  over  state-­‐of-­‐the-­‐art  source  throFling  policies  

34  

Page 35: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

HAT:  Heterogeneous  Adap1ve  Thro4ling  for  On-­‐Chip  Networks  

Kevin  Kai-­‐Wei  Chang    Rachata  Ausavarungnirun  

Chris  Fallin    Onur  Mutlu  

     

Page 36: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Thro4ling  Rate  Steps  

36  

Page 37: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Network  Sizes  

37  

0

10

20

30

40

50

1.3%

3.9%

1.9%

11.1%4.9%

3.7%

VC-BUFFEREDHAT

0

10

20

30

40

50

Wei

ghte

d S

peed

up(H

ighe

r is

Bet

ter)

5.8%

5.9%

3.3%

6.0%

20.0%

10.4%

0

10

20

30

40

50

16 64 144

Network Size

7.2%

10.4%

6.2%

10.8%

17.2%

9.7%BLESSHAT

0

2

4

6

8

0

2

4

6

8

10

12

14

Unf

airn

ess

(Low

er is

Bet

ter)CHIPPER

HAT

16 64 144 0

2

4

6

8

Network Size

Page 38: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Performance  on  CHIPPER/BLESS  

38  

0 4 8

12 16 20 24 28 32 36 40 44 48

HL HML HM H AVG

Weig

hte

d S

peedup

(Hig

her

is B

ett

er)

20.0%

12.3%

25.6%

19.2%20.7%

5.9%

5.9%

6.7%

5.8%5.4%

17.2%

15.4%

24.4%16.9%14.2%

10.4%

8.6%

11.3%

11.5%9.4%

HL HML HM H AVG 0

4

8

12

16

20

Unfa

irness

(Low

er

is B

ett

er)

CHIPPERCHIPPER-HETERO.

CHIPPER-HAT

BLESSBLESS-HETERO.

BLESS-HAT

Injec@on  rate  (number  of  injected  packets  /  cycle):    +8.5%  (BLESS)  or  +4.8%  (CHIPPER)  

Page 39: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Mul1threaded  Workloads  

39  

Page 40: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Mo1va1on  

40  

6  7  8  9  

10  11  12  13  14  15  16  

80   82   84   86   88   90   92   94   96   98   100  

Weighted  Speedu

p  

Thro4ling  Rate  (%)  

Workload  1  Workload  2  

90%  

94%  

Page 41: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Sensi1vity  to  Other  Parameters  •  Network  load  target:  WS  peaks  at  b/w  50%  and  65%  network  u@liza@on  – Drops  more  than  10%  beyond  that  

•  Epoch  length:  WS  varies  by  less  than  1%  b/w  20K  and  1M  cycles  

•  Low-­‐intensity  workloads:  HAT  does  not  impact  system  performance  

•  Unthro4led  network  intensity  threshold:  

41  

Page 42: HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

Implementa1on  •  L1  MPKI:  One  L1  miss  hardware  counter  and  one  instruc@on  hardware  counter  

•  Network  load:  One  hardware  counter  to  monitor  the  number  of  flits  routed  to  neighboring  nodes  

•  Computa1on:  Done  at  one  central  CPU  node  – At  most  several  thousand  cycles  (a  few  percent  overhead  for  100K-­‐cycle  epoch)  

42