16
12/31/12 1 26 th Interna.onal Conference on VLSI January 2013 Pune, India Architectural Alternatives for Energy Efficient Performance Scaling Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute of Technology Outline Impending Power and Thermal Limits to Mul.core New Rules: Scaling Performance Applica.on: CoDesign of a Mul.core Architecture and Thread Scheduler 2

Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

12/31/12  

1  

26th  Interna.onal  Conference  on  VLSI  January  2013  Pune,  India  

 

Architectural Alternatives for Energy Efficient Performance Scaling

Sudhakar Yalamanchili

School of Electrical and Computer Engineering Georgia Institute of Technology

Outline

•  Impending  Power  and  Thermal  Limits  to  Mul.core  

•  New  Rules:  Scaling  Performance  

•  Applica.on:  Co-­‐Design  of  a  Mul.core  Architecture  and  Thread  Scheduler  

2  

Page 2: Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

12/31/12  

2  

Moore’s Law

3  

From wikipedia.org

•  Performance scaled with number of transistors

•  Dennard scaling: power scaled with feature size

Goal: Sustain Performance Scaling

New Rules: The End of Dennard Scaling

tox

SOURCE DRAIN

L

GATE

•  Change  in  scaling  factors  •  Slower  scaling  in  power  per  

transistor  à  increasing  power  densi.es  

From R. Dennard, et al., “Design of ion-implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid State Circuits, vol. SC-9, no. 5, pp. 256-268, Oct. 1974.

4  

Page 3: Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

12/31/12  

3  

Impending Power and Thermal Limits for Multicore

Mukhopadhyay and Yalamanchili (2009)

n Based  on  scaling  using  Pen.um-­‐class  cores  modeled  with  Intsim1  

n Chip  level  performance  will  be  power  and  thermally  limited!  

1D. Sekar, A. Naemi, R. Savari, J. Davis, and J. Meindl, “IntSim: A CAD tool for optimization of multilevel interconnect networks,” Proceedings of the IEEE/ACM international conference on Computer-aided Design, 2007

5  

Impending Power and Thermal Limits for Multicore

6  

Year?

Dark Silicon Gap

Per

form

ance

Predicted by Moore’s Law

Limited by power/thermal

Page 4: Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

12/31/12  

4  

Managing the Physics

7  

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

0.5  

1  

1.5  

2  

2.5  

3  

3.5  

1  14  

27  

40  

53  

66  

79  

92  

105  

118  

131  

144  

157  

170  

183  

196  

209  

222  

235  

248  

261  

274  

287  

300  

313  

326  

339  

352  

Peak  Die  Tem

perature  -­‐>  

CPU    &  GPU  Relative  Power-­‐>  

Time  (seconds)  -­‐>  

GPU  Pow   CPU  CU0  Pow  

CPU  CU1  Pow   PeakDieTemp  

CPU  power  is  limited,  GPU  running  at  max  DVFS  state  

Thermal  coupling  

Temp  thro>ling  

CU0 CU1 GPU

n Thermal  coupling  between  CPU  and  GPU  accelerates  temperature  rise  

n Induces  premature  throUling  

AMD Trinity APU

Paul, Manne, Bircher, & Yalamanchili 2012

Outline

•  Impending  Power  and  Thermal  Limits  to  Mul.core  

•  New  Rules:  Scaling  Performance  

•  Applica.on:  Co-­‐Design  of  a  Mul.core  Architecture  and  Thread  Scheduler  

12/31/12   8  

Page 5: Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

12/31/12  

5  

Post Dennard Architecture Performance Scaling

Perf opss

!

"#

$

%&= Power W( )×Efficiency ops

joule!

"#

$

%&

W. J. Dally, Keynote IITC 2012

Operator_cost + Data_movement_cost

Three operands x 64 bits/operand

Energy = #bits× dist −mm× energy− bit −mm

9  

Borkar & Chien, 2011

Scaling Performance: Cost of Data Movement

10  

Embedded Platforms

Goal: 1-100 GOps/w Goal: 20MW/Exaflop

Big Science: To Exascale

•  Sustain  performance  scaling  through  massive  concurrency  

•  Data  movement  becomes  more  expensive  than  computa.on  

Courtesy:  Sandia  Na1onal  Labs  :R.    Murphy.    

Cost  of  Data  Movement  

Page 6: Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

12/31/12  

6  

Post Dennard Architecture Performance Scaling

Perf opss

!

"#

$

%&= Power W( )×Efficiency ops

joule!

"#

$

%&

W. J. Dally, Keynote IITC 2012

Operator_cost + Data_movement_cost

Three operands x 64 bits/operand Specialization à heterogeneity and asymmetry

Energy = #bits× dist −mm× energy− bit −mm

11  

Borkar & Chien, 2011

Hardware Power-Performance Tradeoffs

Programmability/Flexibility

GO

ps/W

att

In-Order Processor

GPU

FPGA

DSP (LP)

ASIC

Xilinx  Virtex  6  

Westmere-­‐EP    

NVIDIA  Tesla    TMS320671D    

Customization is key to energy efficiency!

freecaroffers.net

Model T

freewebs.com

12  

OOO Processor

Atom  

Page 7: Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

12/31/12  

7  

Computer Architecture Today: Speculation and Locality

Locality of reference (spatial/temporal)

Predictable control flow

Latency Hiding

Frequency Scaling

add r1, r2, r3 sub r4, r1, r5 slt r6, r8, r9 bne r6, exit muld f0, f2, f4

..

.. Speculative execution Speculative fetch

Speculative Pre-fetch

Correctness check on speculation

You Cannot Hide Energy!

Scaling Performance: Simplify and Multiply

AMD Bulldozer Core

ARM A7 Core (arm.com)

•  Extrac.ng  single  thread  performance  is  s.ll  important  –  Amdahl  frac.on  

n Mul.thread  performance  exploits  parallelism  n Simpler  pipelines  n Reduce  energy  per  instruc.on  n Core  scaling  

14  

Page 8: Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

12/31/12  

8  

Major Customization Trends

•  Disrup.ve  impact  on  the  so\ware  stack?  

•  Higher  degree  of  customiza.on  

PowerEN

Uniform ISA Asymmetric

• Minimal  disrup.on  to  the  so\ware  ecosystems  

• Limited  customiza.on?    

Multi-ISA Heterogeneous

Knights Corner

15  

Heterogeneity on Chip Vector Extensions AES Instructions

Programmable Pipeline (GEN6)

Intel Sandy Bridge

Programmable Accelerator

PowerEN

16, PowerPC cores

Accelerators • Crypto Engine • RegEx Engine • XML Engine • CP<[press Engine

16  

Intel Knights Corner

Multiple Models of Computation Multi-ISA

16  

Page 9: Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

12/31/12  

9  

Asymmetry vs. Heterogeneity

n  Mul.ple  voltage  and  frequency  islands  

n  Different  memory  technologies  

n  STT-­‐RAM,  PCM,  Flash  

17  

Tile   Tile  

Tile   Tile  

Tile   Tile  

Tile   Tile  

Tile   Tile  

Tile   Tile  

Tile   Tile  

Tile   Tile  

MC  

MC  

MC  

MC  

Tile  

Tile  

Tile  

Tile  

Tile  

Tile  

Tile  

Tile  

MC  

MC  

MC  

MC  

Performance Asymmetry

Functional Asymmetry

Heterogeneous

–  Complex  cores  and  simple  cores  

–  Shared  instruc.on  set  architecture  (ISA)  

•  Subset  ISA  •  Dis.nct    microarchitecture  •  Fault  and  migrate  model  of  

opera.on1  

Uniform ISA Multi-ISA

1Li., T., et.al., “Operating system support for shared ISA asymmetric multi-core architectures,” in WIOSCA, 2008.

n  Mul.-­‐ISA  n  Microarchitecture  

n  Memory  &  Interconnect  hierarchy  

System Diversity

Keeneland System (GPUs)

Amazon EC2 GPU Instances

Hardware Diversity is Mainstream

Mobile Platforms (DSP, GPUs)

18  

Cray Titan (GPUs)

Page 10: Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

12/31/12  

10  

Application Space

•  Applica.ons  are  diversifying  at  a  remarkable  rate!  

•  Computa.ons  tend  to  be  irregular,  unstructured,  and  behaviors  hard  to  predict  

19  

Summary: New Performance Scaling Rules •  Energy  efficiency:  Scale  performance  by  scaling  energy  efficiency  

•  Parallelism:  Scale  number  of  cores  rather  than  performance  of  a  single  core  

•  Data  Movement:  Energy  cost  of  data  movement  is  more  expensive  than  the  energy  cost  of  computa.on  

•  Physics  Capacity:  Scaling  limited  by  thermal/power  capacity  

20  

Page 11: Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

12/31/12  

11  

Outline

•  Impending  Power  and  Thermal  Limits  to  Mul.core  

•  New  Rules:  Scaling  Performance  

•  Applica.on:  Co-­‐Design  of  a  Mul.core  Architecture  and  Thread  Scheduler  

12/31/12   21  

o  Focus: asymmetric, coherent shared memory architecture

Energy Efficient Asymmetric Multicore Architecture

12/31/12   22  

•  How  we  do  organize  architecture?  – Cores  and  memory  hierarchy  

•  Key  observa.ons  – Performance:  reduced  by  migra.on  cost  

– Energy:  Reduced  by  core  types  (not  core  loca.on!)  

Energy efficient organization

Core diversity

Workload diversity

Data Movement optimization

Choudhary & Yalamanchili 2012

Page 12: Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

12/31/12  

12  

Application Characteristics

•  Need  to  match  .me-­‐varying  applica.on  behaviors  to  the  right  core  – Fine  grained  scheduling  

12/31/12   23  

Out-of-order (OOO) vs. In-order (IO) cores

For PARSEC and SPEC2006 benchmarks

opportunity

opportunity

Asymmetric Multicore Organization

12/31/12   24  

•  Locally  Asymmetric  Globally  Symmetric  (LAGS)  organiza.on  

•  Reduce  migra.on  costs  for  same  level  of  energy  efficiency  

Page 13: Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

12/31/12  

13  

Microscheduling

12/31/12   25  

•  OOO  core  .me  scheduled  based  on  measured  performance,  e.g.,  IPC  

•  Compute  a  new  schedule  every  T  steps  

When and how do we share the high energy, high performance core?

Formulation

12/31/12   26  

•  Func.on  is  convex  •  Lends  itself  to  simpler  op.miza.on  techniques  

•  Comparison  to  a  full  non-­‐linear  op.miza.on  formula.on  

Page 14: Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

12/31/12  

14  

Some Experimental Results: Setup

12/31/12   27  

•  IO  core  mimics  a  14  stage  Atom  core  •  Account  for  energy  of  context  switches  and  migra.on  

Utility Scheduling

12/31/12   28  

ISCA2012 Sampling-Based Non-linear Opt. Utility-Based Perf/Energy Metric

•  Baseline  is  round  robin  •  T=5ms  

Page 15: Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

12/31/12  

15  

Effect of Core Composition

12/31/12   29  

Integrated Scheduling and Cache Management

12/31/12   30  

Coordinate scheduling and

cache management

•  Cores  access  the  cache  at  different  rates  crea.ng  interference  paUerns  

Page 16: Architectural Alternatives for Energy Efficient Performance Scalingcasl.gatech.edu/.../2013/01/VLSID_2013_yalamanchili.pdf · 2013-01-30 · Architectural Alternatives for Energy

12/31/12  

16  

Impact of Co-Design

•  Key:  Number  of  core  types  maUer,  not  the  loca.on  à  locally  asymmetric  

•  Exploit  varia.on  in  the  workload  à  microscheduling  

•  Local  coherence  domain  à  minimize  migra.on  costs  

12/31/12   31  

Conclusion

•  New  Scaling  Rules  – Data  movement,  core  scaling,  energy  efficiency  

•  Must  take  a  more  holis.c  view  •  Heterogeneity,  asymmetry,  and  technology  diversity  is  the  new  normal  

12/31/12   ©  VLSI  Design  Conference  2013   32