Upload
bary
View
50
Download
0
Tags:
Embed Size (px)
DESCRIPTION
SIMD Lane Decoupling Improved Timing-Error Resilience. Evgeni Krimer (UT Austin) Patrick Chiang (Oregon State) Mattan Erez (UT Austin). All systems power/energy bound. The good: Transistor still following Moore’s Law The bad: Transistor power efficiency improving too slowly - PowerPoint PPT Presentation
Citation preview
SIMD Lane DecouplingImproved Timing-Error Resilience
Evgeni Krimer (UT Austin)Patrick Chiang (Oregon State)Mattan Erez (UT Austin)
2
All systems power/energy bound• The good:
– Transistor still following Moore’s Law• The bad:
– Transistor power efficiency improving too slowly– Larger fraction of power to non-compute resources
• The conclusion:– Better algorithms– More efficient architectures– Proportionality: waste less of what you have
• This paper: SIMD + timing speculation– Efficient architecture + proportional guardbands
SIMD Lane Decoupling (C) M. Erez, E. Krimer
3
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Outline• Setup:
efficient architecture + proportional margining• Proportional margining w/ timing speculation• Timing speculation with SIMD
– Problem and DPSP solution• Methodology and modeling• Evaluation
4
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Voltage/timing margins “waste” energy• Illustrative only – not to scale
Max
imum
lo
gic
dela
y
Noi
se g
uard
-ba
ndW
earo
ut
guar
d-ba
nd
Proc
ess
vari
atio
n gu
ard-
band
Tem
pera
ture
…Typical
logic delay
Toda
y
time (1 cycle)
5
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Voltage/timing margins “waste” energy• Illustrative only – not to scale
Max
imum
lo
gic
dela
y
Noi
se g
uard
-ba
ndW
earo
ut
guar
d-ba
nd
Proc
ess
vari
atio
n gu
ard-
band
Tem
pera
ture
…
Max
imum
logi
c de
lay
Noi
se g
uard
-ba
nd
Wea
rout
gu
ard-
band
Proc
ess
vari
atio
n gu
ard-
band
Tem
pera
ture
…
Typical logic delay
Typical logic delay
Toda
y
time (1 cycle)
Futu
re
6
Timing speculation to the rescue [Ernst04]• Razor latches• Speculate low delay• Detect violations
– Early/late mismatch• Recover by stalling
– Requires fast “global” signal
– Alternative – flush
• Requires extra ~10% logic • Path delay restrictions:
Δ < t < Δ+cycle
SIMD Lane Decoupling (C) M. Erez, E. Krimer
7
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Outline• Setup:
SIMD architecture + proportional margining• Proportional margining w/ timing speculation• Timing speculation with SIMD
– Problem and DPSP solution• Methodology and modeling• Evaluation
8
SIMD Lane Decoupling (C) M. Erez, E. Krimer
SIMD leads to inefficient timing speculation
SIMD Lane Decoupling (C) M. Erez, E. Krimer
9
SIMD Lane Decoupling (C) M. Erez, E. Krimer
SIMD leads to inefficient timing speculation
SIMD Lane Decoupling (C) M. Erez, E. Krimer
0.50.60.70.80.9
1
0 0.02 0.04 0.06 0.08 0.1
Frac
tion
of p
eak
thro
ughp
ut
Probability of an error in a single stage, single lane
SISD16-wide SIMD32-wide SIMD
10
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Decoupled Parallel SIMD Pipeline (DPSP)• Shallow FIFO for control (or between stages)
11
SIMD Lane Decoupling (C) M. Erez, E. Krimer
0.50.60.70.80.9
1
0 0.02 0.04 0.06 0.08 0.1
Frac
tion
of p
eak
thro
ughp
ut
Probability of an error in a single stage, single lane
SISD32-wide SIMD32-wide DPSP
Decoupled Parallel SIMD Pipeline (DPSP)• Decoupling mitigates SIMD impact
12
DPSP challenge 1: inter-lane communication• Decoupling may delay producer (store)• Micro barriers
– Enforce SIMD semantics• Not a problem in practice
with GPUs– Execution model requires
explicit sync across CTAs / work-groups
SIMD Lane Decoupling (C) M. Erez, E. Krimer
13
SIMD Lane Decoupling (C) M. Erez, E. Krimer
DPSP challenge 2: memory access locality• Loads and stores no longer aligned
– Memory “divergence”• May increase pressure on on-chip memory
access• May impact off-chip access
– Old NVIDIA hardware had memory coalescing issues– No Problem with coalescing buffers and caches
• Micro-barriers if problematic– Can be done implicitly or explicitly in hardware– Sync before every load– Prediction
14
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Outline• Setup:
efficient architecture + proportional margining• Proportional margining w/ timing speculation• Timing speculation with SIMD
– Problem and DPSP solution• Methodology and modeling• Evaluation
15
Evaluation flow
Error Measurements
Error Probability Model
Energy-Efficiency Model
Design Space Exploration
Arch Sim. Validation
SIMD Lane Decoupling (C) M. Erez, E. Krimer
16
Measuring error rate• Inherently circuit and
implementation dependent• Used 3 exemplary circuits
– SPICE-simulated adder [Ernst04]
– FPGA-modeled multiplier [Ernst04]
– Multiplier fabricated in our IBM 45nm SOI test chip[Pawlowski12]
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Pawlowski ISSCC’12
17
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Modeling the error rate function• 2-parameter model
errVmax
Slope
Adder [Ernst04]Mul. [Ernst04]
18
SIMD Lane Decoupling (C) M. Erez, E. Krimer
ET2 energy-efficiency metric• Energy x (execution)Time2
– In circuit context: time=delay -> ED2
• Isolates architecture efficiency – Independent of DVFS– Shows improvements in addition to DVFS
2ddVE
ddVt 1
19
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Simple ET2 model• Throughput (1/T):
• Relative energy:Dynamic Static
20
SIMD Lane Decoupling (C) M. Erez, E. Krimer
GP-GPU simulation adds some realism• Baseline uses ideal margins without
specuation– Only max delay vs. typical delay left on table– Timing speculation overhead is 0 – 15% ET2
• GPGPUSim (version 2.1)– Cycle-based extendable GP-GPU simulator from UBC
• Developer-recommended parameters• Extended to DPSP
– Recovery through stall– Micro-barrier options
• Explicit CTA/workgroup synchronization only (no mbarriers)• Implicit sync before every memory operation
• Power model based on Hong & Kim, ISCA’10
21
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Outline• Setup:
efficient architecture + proportional margining• Proportional margining w/ timing speculation• Timing speculation with SIMD
– Problem and DPSP solution• Methodology and modeling• Evaluation
– Design-space exploration– Architecture effects
22
SIMD Lane Decoupling (C) M. Erez, E. Krimer
ET2 vs. SIMD (no spec.)
DPSP
errVmax
Slope
Adder [Ernst04]Mul. [Ernst04]
• DPSP
*- Relative ET2 - lower elevation is better
23
SIMD Lane Decoupling (C) M. Erez, E. Krimer
DPSP vs. SIMD (w/ spec.)
*- ET2 Difference - higher elevation is better
errVmax
Slope
Adder [Ernst04]Mul. [Ernst04]
• SIMD – DPSP
24
Bringing in architecture effectsSIMD Lane Decoupling (C) M. Erez, E. Krimer
Adder
Fabricated MUL
25
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Summary• Design margins inefficiency• Naive timing speculation with SIMD is inefficient• DPSP enables efficient speculation in SIMD
– Microbarriers maintain semantics when necessary– With GPU, frequent mbarriers help memory access
• Simple models can capture error response– Error rate exponential with Vdd– Dependent on circuit and implementation
• Design-space exploration shows potential– When and why timing speculation should (not) be used– DPSP consistently improves ET2 (10 – 45%)– DPSP achieves 10 – 20% better ET2 than SIMD w/ spec.
26
BACKUP
SIMD Lane Decoupling (C) M. Erez, E. Krimer
27
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Detailed ET2 vs. Vdd behaviorNN AES
BFSMUM
28
Frequent micro-barriers improve ET2
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Adder
Multiplier
Fab.
29
Modeling the error rate functionSIMD Lane Decoupling (C) M. Erez, E. Krimer
errVmax
Slope
Adder [Ernst04]Mul. [Ernst04]
30
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Proportional margining• Static margin control
– Binning– Vdd/frequency/biasing adjustment
• Dynamic margin control– Vdd/frequency/biasing for slowly varying effects
• Temperature and aging– Clocking tricks
• From GALS to dynamic and elastic clockingM
axim
um
logi
c de
lay
Noi
se g
uard
-ba
ndW
earo
ut
guar
d-ba
nd
Proc
ess
vari
atio
n gu
ard-
band
Cloc
k Sk
ew
and
jitte
r
Typical logic delay
time
Other
31
Detailed results summary• BFS
– High divergence rate– Requires implicit synchronizations– Limits DPSP opportunities
• CP,DG,RAY– Sensitive to memory coalescing– Synchronization between memory operations solves it
• MUM– Low SIMD occupancy limits the benefit of decoupling
• WP– Not enough registers, lots of memory spills.– Extremely sensitive to memory latency and the exact
scheduling – disturbed by DPSP
SIMD Lane Decoupling (C) M. Erez, E. Krimer