Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
12/31/12
1
26th Interna.onal Conference on VLSI January 2013 Pune, India
Architectural Alternatives for Energy Efficient Performance Scaling
Sudhakar Yalamanchili
School of Electrical and Computer Engineering Georgia Institute of Technology
Outline
• Impending Power and Thermal Limits to Mul.core
• New Rules: Scaling Performance
• Applica.on: Co-‐Design of a Mul.core Architecture and Thread Scheduler
2
12/31/12
2
Moore’s Law
3
From wikipedia.org
• Performance scaled with number of transistors
• Dennard scaling: power scaled with feature size
Goal: Sustain Performance Scaling
New Rules: The End of Dennard Scaling
tox
SOURCE DRAIN
L
GATE
• Change in scaling factors • Slower scaling in power per
transistor à increasing power densi.es
From R. Dennard, et al., “Design of ion-implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid State Circuits, vol. SC-9, no. 5, pp. 256-268, Oct. 1974.
4
12/31/12
3
Impending Power and Thermal Limits for Multicore
Mukhopadhyay and Yalamanchili (2009)
n Based on scaling using Pen.um-‐class cores modeled with Intsim1
n Chip level performance will be power and thermally limited!
1D. Sekar, A. Naemi, R. Savari, J. Davis, and J. Meindl, “IntSim: A CAD tool for optimization of multilevel interconnect networks,” Proceedings of the IEEE/ACM international conference on Computer-aided Design, 2007
5
Impending Power and Thermal Limits for Multicore
6
Year?
Dark Silicon Gap
Per
form
ance
Predicted by Moore’s Law
Limited by power/thermal
12/31/12
4
Managing the Physics
7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.5
1
1.5
2
2.5
3
3.5
1 14
27
40
53
66
79
92
105
118
131
144
157
170
183
196
209
222
235
248
261
274
287
300
313
326
339
352
Peak Die Tem
perature -‐>
CPU & GPU Relative Power-‐>
Time (seconds) -‐>
GPU Pow CPU CU0 Pow
CPU CU1 Pow PeakDieTemp
CPU power is limited, GPU running at max DVFS state
Thermal coupling
Temp thro>ling
CU0 CU1 GPU
n Thermal coupling between CPU and GPU accelerates temperature rise
n Induces premature throUling
AMD Trinity APU
Paul, Manne, Bircher, & Yalamanchili 2012
Outline
• Impending Power and Thermal Limits to Mul.core
• New Rules: Scaling Performance
• Applica.on: Co-‐Design of a Mul.core Architecture and Thread Scheduler
12/31/12 8
12/31/12
5
Post Dennard Architecture Performance Scaling
Perf opss
!
"#
$
%&= Power W( )×Efficiency ops
joule!
"#
$
%&
W. J. Dally, Keynote IITC 2012
Operator_cost + Data_movement_cost
Three operands x 64 bits/operand
Energy = #bits× dist −mm× energy− bit −mm
9
Borkar & Chien, 2011
Scaling Performance: Cost of Data Movement
10
Embedded Platforms
Goal: 1-100 GOps/w Goal: 20MW/Exaflop
Big Science: To Exascale
• Sustain performance scaling through massive concurrency
• Data movement becomes more expensive than computa.on
Courtesy: Sandia Na1onal Labs :R. Murphy.
Cost of Data Movement
12/31/12
6
Post Dennard Architecture Performance Scaling
Perf opss
!
"#
$
%&= Power W( )×Efficiency ops
joule!
"#
$
%&
W. J. Dally, Keynote IITC 2012
Operator_cost + Data_movement_cost
Three operands x 64 bits/operand Specialization à heterogeneity and asymmetry
Energy = #bits× dist −mm× energy− bit −mm
11
Borkar & Chien, 2011
Hardware Power-Performance Tradeoffs
Programmability/Flexibility
GO
ps/W
att
In-Order Processor
GPU
FPGA
DSP (LP)
ASIC
Xilinx Virtex 6
Westmere-‐EP
NVIDIA Tesla TMS320671D
Customization is key to energy efficiency!
freecaroffers.net
Model T
freewebs.com
12
OOO Processor
Atom
12/31/12
7
Computer Architecture Today: Speculation and Locality
Locality of reference (spatial/temporal)
Predictable control flow
Latency Hiding
Frequency Scaling
add r1, r2, r3 sub r4, r1, r5 slt r6, r8, r9 bne r6, exit muld f0, f2, f4
..
.. Speculative execution Speculative fetch
Speculative Pre-fetch
Correctness check on speculation
You Cannot Hide Energy!
Scaling Performance: Simplify and Multiply
AMD Bulldozer Core
ARM A7 Core (arm.com)
• Extrac.ng single thread performance is s.ll important – Amdahl frac.on
n Mul.thread performance exploits parallelism n Simpler pipelines n Reduce energy per instruc.on n Core scaling
14
12/31/12
8
Major Customization Trends
• Disrup.ve impact on the so\ware stack?
• Higher degree of customiza.on
PowerEN
Uniform ISA Asymmetric
• Minimal disrup.on to the so\ware ecosystems
• Limited customiza.on?
Multi-ISA Heterogeneous
Knights Corner
15
Heterogeneity on Chip Vector Extensions AES Instructions
Programmable Pipeline (GEN6)
Intel Sandy Bridge
Programmable Accelerator
PowerEN
16, PowerPC cores
Accelerators • Crypto Engine • RegEx Engine • XML Engine • CP<[press Engine
16
Intel Knights Corner
Multiple Models of Computation Multi-ISA
16
12/31/12
9
Asymmetry vs. Heterogeneity
n Mul.ple voltage and frequency islands
n Different memory technologies
n STT-‐RAM, PCM, Flash
17
Tile Tile
Tile Tile
Tile Tile
Tile Tile
Tile Tile
Tile Tile
Tile Tile
Tile Tile
MC
MC
MC
MC
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
MC
MC
MC
MC
Performance Asymmetry
Functional Asymmetry
Heterogeneous
– Complex cores and simple cores
– Shared instruc.on set architecture (ISA)
• Subset ISA • Dis.nct microarchitecture • Fault and migrate model of
opera.on1
Uniform ISA Multi-ISA
1Li., T., et.al., “Operating system support for shared ISA asymmetric multi-core architectures,” in WIOSCA, 2008.
n Mul.-‐ISA n Microarchitecture
n Memory & Interconnect hierarchy
System Diversity
Keeneland System (GPUs)
Amazon EC2 GPU Instances
Hardware Diversity is Mainstream
Mobile Platforms (DSP, GPUs)
18
Cray Titan (GPUs)
12/31/12
10
Application Space
• Applica.ons are diversifying at a remarkable rate!
• Computa.ons tend to be irregular, unstructured, and behaviors hard to predict
19
Summary: New Performance Scaling Rules • Energy efficiency: Scale performance by scaling energy efficiency
• Parallelism: Scale number of cores rather than performance of a single core
• Data Movement: Energy cost of data movement is more expensive than the energy cost of computa.on
• Physics Capacity: Scaling limited by thermal/power capacity
20
12/31/12
11
Outline
• Impending Power and Thermal Limits to Mul.core
• New Rules: Scaling Performance
• Applica.on: Co-‐Design of a Mul.core Architecture and Thread Scheduler
12/31/12 21
o Focus: asymmetric, coherent shared memory architecture
Energy Efficient Asymmetric Multicore Architecture
12/31/12 22
• How we do organize architecture? – Cores and memory hierarchy
• Key observa.ons – Performance: reduced by migra.on cost
– Energy: Reduced by core types (not core loca.on!)
Energy efficient organization
Core diversity
Workload diversity
Data Movement optimization
Choudhary & Yalamanchili 2012
12/31/12
12
Application Characteristics
• Need to match .me-‐varying applica.on behaviors to the right core – Fine grained scheduling
12/31/12 23
Out-of-order (OOO) vs. In-order (IO) cores
For PARSEC and SPEC2006 benchmarks
opportunity
opportunity
Asymmetric Multicore Organization
12/31/12 24
• Locally Asymmetric Globally Symmetric (LAGS) organiza.on
• Reduce migra.on costs for same level of energy efficiency
12/31/12
13
Microscheduling
12/31/12 25
• OOO core .me scheduled based on measured performance, e.g., IPC
• Compute a new schedule every T steps
When and how do we share the high energy, high performance core?
Formulation
12/31/12 26
• Func.on is convex • Lends itself to simpler op.miza.on techniques
• Comparison to a full non-‐linear op.miza.on formula.on
12/31/12
14
Some Experimental Results: Setup
12/31/12 27
• IO core mimics a 14 stage Atom core • Account for energy of context switches and migra.on
Utility Scheduling
12/31/12 28
ISCA2012 Sampling-Based Non-linear Opt. Utility-Based Perf/Energy Metric
• Baseline is round robin • T=5ms
12/31/12
15
Effect of Core Composition
12/31/12 29
Integrated Scheduling and Cache Management
12/31/12 30
Coordinate scheduling and
cache management
• Cores access the cache at different rates crea.ng interference paUerns
12/31/12
16
Impact of Co-Design
• Key: Number of core types maUer, not the loca.on à locally asymmetric
• Exploit varia.on in the workload à microscheduling
• Local coherence domain à minimize migra.on costs
12/31/12 31
Conclusion
• New Scaling Rules – Data movement, core scaling, energy efficiency
• Must take a more holis.c view • Heterogeneity, asymmetry, and technology diversity is the new normal
12/31/12 © VLSI Design Conference 2013 32