Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Asymmetry-Aware Work-Stealing Runtimes
Christopher Torng, Moyang Wang, andChristopher Batten
School of Electrical and Computer EngineeringCornell University
Cornell IAP WorkshopOctober 2015
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Asymmetry-Aware Work-Stealing Runtimes
Work-StealingRuntimes
Single-ISAHeterogeneousArchitectures
Dynamic Voltageand Frequency
Scaling
DynamicAsymmetry
StaticAsymmetry
Cornell University Christopher Batten 2 / 18
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing Runtimes
Work inProgress
TaskQueues
Core 0 Core 1 Core 2 Core 3
Task A
Cornell University Christopher Batten 3 / 18
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing Runtimes
Work inProgress
TaskQueues
Core 0 Core 1 Core 2 Core 3
Task B
Task A
Spawn Task B
Cornell University Christopher Batten 3 / 18
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing Runtimes
Work inProgress
TaskQueues
Core 0 Core 1 Core 2 Core 3
Task B
Dequeue Task B
Cornell University Christopher Batten 3 / 18
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing Runtimes
Work inProgress
TaskQueues
Core 0 Core 1 Core 2 Core 3
Task C
Task B
Spawn Task C
Cornell University Christopher Batten 3 / 18
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing Runtimes
Work inProgress
TaskQueues
Core 0 Core 1 Core 2 Core 3
Task B
Spawn Task D
Task D
Task C
Cornell University Christopher Batten 3 / 18
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing Runtimes
Work inProgress
TaskQueues
Core 0 Core 1 Core 2 Core 3
Task CTask D
Steal Task D Steal Task C
Task B
Cornell University Christopher Batten 3 / 18
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing Runtimes
Work inProgress
TaskQueues
Core 0 Core 1 Core 2 Core 3
Task CTask D
Task E Task F
Spawn Task FSpawn Task E
Cornell University Christopher Batten 3 / 18
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing Runtimes
Work inProgress
TaskQueues
Core 0 Core 1 Core 2 Core 3
Task CTask DTask E Task F
Steal Task E Steal Task F
I Work stealing has good performance, space requirements, andcommunication overheads in both theory and practice
I Supported in many popular concurrency plaforms including:Intel’s Cilk Plus, Intel’s C++ TBB, Microsoft’s .NET Task ParallelLibrary, Java’s Fork/Join Framework, and OpenMP
Cornell University Christopher Batten 3 / 18
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Static Asymmetry: Heterogeneous Multicore Systems
Performance (Millions of Instructions per Second)
En
erg
y (P
ico-J
oule
pe
r In
stru
ctio
n)
200 400 600 800 1000 1200 1400 1600 1800 2000
50
100
150
200
250
500
Incr
easing P
ower
in-order, 1-issue
Based on analytical models of90nm technology with joint optimization ofmicroarchitectural and circuit parameters
in-order, 2-issuein-order, 3-issueout-of-order, 1-issue
out-of-order, 3-issueout-of-order, 2-issue
IO/3
OOO/2
OOO/3
IO/1
Adpated from O. Azizi et al. “Energy-Performance Tradeoffs inProcessor Architecture and Circuit Design: A Marginal Cost Approach,” ISCA, 2010.
BigARM Cores
LittleARM Cores
A7 A7 A15 A15
L2$ L2$
Samsung Exynos OctaMobile Processor
Cornell University Christopher Batten 4 / 18
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Dynamic Asymmetry: Voltage/Frequency ScalingC
ore
Ener
gy C
onsu
mpt
ion
Performance
Fmin @ Vmin
Fmax @ Vmax
Fnom @ Vnom
VLLC
GraphicsEngine
Rin
g In
terc
onne
ct
SystemAgent
+PCIe
+DMI
VDRAM Cntl
VCCIN
FIV
R
CPUs+
Cache
VCPU0
VRING
VGPU
VSA
VIOA
VCPU1……
Intel Haswell integrates thevoltage control loop circuitry on-die
with inductors in-package.
Cornell University Christopher Batten 5 / 18
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-StealingRuntimes
DynamicAsymmetry
StaticAsymmetry
Bender et al."Online Scheduling
of Parallel Programs onHeterogeneous Sys ..."
SPAA 2002
Ribic et al."Energy-Efficient
Work-StealingLanguage Runtimes"
ASPLOS 2014
Azizi et al."Energy-performance Tradeoffs in Processor Architecture and Circuit Design:
A Marginal Cost Analysis" ISCA 2010
How can we use asymmetry awareness to improve theperformance and energy efficiency of a work-stealing runtime?
Cornell University Christopher Batten 6 / 18
Motivation • First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-StealingRuntimes
DynamicAsymmetry
StaticAsymmetry
Talk Outline
Motivation
First-Order Modeling
Asymmetry-AwareWork-Stealing Runtimes
Evaluation
Cornell University Christopher Batten 7 / 18
Motivation • First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes Evaluation
0.6 0.8 1.0 1.2 1.4 1.6
Normalized Performance
Nor
mal
ized
Ene
rgy
Effi
cien
cy
0.8
0.9
1.0
1.1
1.2
VL = 1.0VVB = 1.0V
1.3
isop
ower
Higher PerformanceLower Energy Efficiency
Lower PerformanceHigher Energy Efficiency
VL = 1.3VVB = 1.0V
VL = 1.3VVB = 0.9V
Higher PerformanceHigher Energy Efficiency
Systematic wayto adjust voltageand frequency
to move into upperright quadrant?
Use average powerat Vnom as anoptimizationconstraint
Cornell University Christopher Batten 8 / 18
Motivation • First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes Evaluation
L3
L2
L1
L0
B3
B2
B1
B0
3
21
Opportunity #1
Balance performance/poweracross cores when all cores are busy
System with four active big cores and four active little cores
BLNo
rmal
ized
Powe
r
Normalized IPS0.0 0.5 1.0 1.5 2.0 2.5 3.001234567
VLVB
B LMar
gina
l IPS
/W
0.70 1.00 1.30 1.60 1.901.04 1.00 0.92 0.76 0.24
AggregateThroughput
Aggr
egat
e Sy
stem
IPS
0.00.20.40.60.81.01.2
0.70 1.00 1.30 1.60 1.901.04 1.00 0.92 0.76 0.24
VLVB
Technique #1: Work-Pacing
Cornell University Christopher Batten 9 / 18
Motivation • First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes Evaluation
L3
L2
L1
L0
B3
B2
B1
B0
3
21
Opportunity #2
Rest waiting cores thenbalance performance and power
acrossy busy cores
4B4L system with two active big cores and two active little cores
BL
Norm
aliz
ed P
ower
Normalized IPS0.0 0.5 1.0 1.5 2.0 2.5 3.001234567
2.200.39
VLVB
Mar
gina
l IPS
/W
B L
0.20 0.60 1.00 1.40 1.801.26 1.24 1.211.14 0.97
Aggr
egat
e Sy
stem
IPS
VLVB0.20 0.60 1.00 1.40 1.801.26 1.24 1.211.14 0.97
2.200.39
0.00.20.40.60.81.01.21.41.6
Technique #2: Work-Sprinting
Cornell University Christopher Batten 10 / 18
Motivation • First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes Evaluation
L3
L2
L1
L0
B3
B2
B1
B0
3
21
Opportunity #3
Move work from slow little coresto fast big cores
4B4L system with either one active big core OR one active little core
B
L
Nor
mal
ized
Pow
er
Normalized IPS0.0 0.5 1.0 1.5 2.0 2.5 3.001234567
Optimal VL = 2.59V
On Little CoreOptimal VB = 1.51V
On Big Core
Feasible VL = 1.3V Feasible VB = 1.3V
Speedup of 1.6x Speedup of 3.3x
Technique #3: Work-Mugging
Cornell University Christopher Batten 11 / 18
Motivation First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes • Evaluation
Work-StealingRuntimes
DynamicAsymmetry
StaticAsymmetry
Talk Outline
Motivation
First-Order Modeling
Asymmetry-AwareWork-Stealing Runtimes
Evaluation
Cornell University Christopher Batten 12 / 18
Motivation First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes • Evaluation
AAWS Runtime Techniques
On-Chip Interconnect
L1$ L1$
B0 L0
VoltageRegulators
DRAM Memory Controller
L1$ L1$
B1 L1
Shared L2$ Cache Banks
Work-PacingWhen all cores are busy, tune V&Fof little and big cores (based on offlinemarginal utility analysis) to increaseperformance and energy efficiency
DVFSController
HintsDecision
Vdd
Cornell University Christopher Batten 13 / 18
Motivation First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes • Evaluation
AAWS Runtime Techniques
On-Chip Interconnect
L1$ L1$
B0 L0
VoltageRegulators
DRAM Memory Controller
L1$ L1$
B1 L1
Shared L2$ Cache Banks
Work-PacingWhen all cores are busy, tune V&Fof little and big cores (based on offlinemarginal utility analysis) to increaseperformance and energy efficiency
DVFSController
HintsDecision
Vdd
Work-SprintingRest waiting cores to generate powerslack, tune V&F of busy cores (basedon offline marginal utility analysis)
Cornell University Christopher Batten 13 / 18
Motivation First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes • Evaluation
AAWS Runtime Techniques
On-Chip Interconnect
L1$ L1$
B0 L0
VoltageRegulators
DRAM Memory Controller
L1$ L1$
B1 L1
Shared L2$ Cache Banks
Work-PacingWhen all cores are busy, tune V&Fof little and big cores (based on offlinemarginal utility analysis) to increaseperformance and energy efficiency
DVFSController
Work-SprintingRest waiting cores to generate powerslack, tune V&F of busy cores (basedon offline marginal utility analysis)
Work-MuggingMove work from little to big cores byallowing big cores to preemptively mugtasks from little cores
User-Level Interrupt Network
VoltageRegulators
DVFSController
MugInstruction
MugInterrupt
Thread Context Swap
Cornell University Christopher Batten 13 / 18
Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes • Evaluation •
Work-StealingRuntimes
DynamicAsymmetry
StaticAsymmetry
Talk Outline
Motivation
First-Order Modeling
Asymmetry-AwareWork-Stealing Runtimes
Evaluation
Cornell University Christopher Batten 14 / 18
Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes • Evaluation •
Evaluation Methodology: Modeling
Work-Stealing RuntimeI State-of-the-art Intel TBB-inspired work-stealing schedulerI Chase-Lev task queues with occupancy-based victim selectionI Automatic recursive decomposition of parallel loop task chunkingI Instrumented with activity hints
Cycle-Level ModelingI Heterogeneous system modeled in gem5 cycle-approximate simulatorI Support for scaling per-core frequencies + central DVFS ControllerI Accounting for intercore interrupt latency during mugging
Energy ModelingI Event-based energy modeling based on McPAT models and detailed
RTL/gate-level sims (Synopsys ASIC toolflow, TSMC LP, 65 nm 1.0 V)I Per-inst energy benchmarks to isolate event energy (e.g., rfile reads)
Cornell University Christopher Batten 15 / 18
Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes • Evaluation •
Performance of Convex Hull ApplicationNo
rmal
ized
Exe
cutio
n Ti
me
1.00
High
Par
alle
l
Baseline
1.00 V 0.91 V0.70 V 1.06 V 1.33 VBusy Waiting
Big
Littl
e
Low
Par
alle
l
0.92
Work-Pacing
8% reduction
0.75
Work-Pacing, Sprinting
25%reduction
Big
Littl
e0.71
Work-Pacing, Sprinting, Mugging
29%reduction
Cornell University Christopher Batten 16 / 18
Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes • Evaluation •
Energy-Efficiency and Performance Results
0.0 0.2 0.4 0.6 0.80.0
0.2
0.4
0.6
0.8 baseline
Nor
mal
ized
Ene
rgy
Effic
ienc
y
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
isopo
wer
Performance1.0 1.21.1 1.3 1.4 1.5 1.6
baseline+work-pacing
baseline+work-pacing+work-sprinting
baseline+work-pacing+work-sprinting+work-mugging
Cornell University Christopher Batten 17 / 18
Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes • Evaluation •
Energy-Efficiency and Performance Results
0.0 0.2 0.4 0.6 0.80.0
0.2
0.4
0.6
0.8 baseline
Nor
mal
ized
Ene
rgy
Effic
ienc
y
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
isopo
wer
Performance1.0 1.21.1 1.3 1.4 1.5 1.6
baseline+work-pacing
baseline+work-pacing+work-sprinting
baseline+work-pacing+work-sprinting+work-mugging
Radix SortSpeedup 1.26xEnergy Efficiency 1.53x
Convex HullSpeedup 1.41xEnergy Efficiency 1.24x
Cornell University Christopher Batten 17 / 18
Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes • Evaluation •
Work-StealingRuntimes
DynamicAsymmetry
StaticAsymmetry
Take-Away Point
Holistically combining
• work-stealing runtimes• static asymmetry• dynamic asymmetry
through the use of
• work-pacing• work-sprinting• work-mugging
can improve both performance andenergy efficiency in future multicoresystems
Cornell University Christopher Batten 18 / 18