SIMD Lane Decoupling Improved Timing-Error Resilience

SIMD Lane DecouplingImproved Timing-Error Resilience

Evgeni Krimer (UT Austin)Patrick Chiang (Oregon State)Mattan Erez (UT Austin)

2

All systems power/energy bound• The good:

– Transistor still following Moore’s Law• The bad:

– Transistor power efficiency improving too slowly– Larger fraction of power to non-compute resources

• The conclusion:– Better algorithms– More efficient architectures– Proportionality: waste less of what you have

• This paper: SIMD + timing speculation– Efficient architecture + proportional guardbands

SIMD Lane Decoupling (C) M. Erez, E. Krimer

3


Outline• Setup:

efficient architecture + proportional margining• Proportional margining w/ timing speculation• Timing speculation with SIMD

– Problem and DPSP solution• Methodology and modeling• Evaluation

4


Voltage/timing margins “waste” energy• Illustrative only – not to scale

Max

imum

lo

gic

dela

y

Noi

se g

uard

-ba

ndW

earo

ut

guar

d-ba

nd

Proc

ess

vari

atio

n gu

ard-

band

Tem

pera

ture

…Typical

logic delay

Toda

y

time (1 cycle)

5


Voltage/timing margins “waste” energy• Illustrative only – not to scale

Max

imum

lo

gic

dela

y

Noi

se g

uard

-ba

ndW

earo

ut

guar

d-ba

nd

Proc

ess

vari

atio

n gu

ard-

band

Tem

pera

ture

…

Max

imum

logi

c de

lay

Noi

se g

uard

-ba

nd

Wea

rout

gu

ard-

band

Proc

ess

vari

atio

n gu

ard-

band

Tem

pera

ture

…

Typical logic delay

Typical logic delay

Toda

y

time (1 cycle)

Futu

re

6

Timing speculation to the rescue [Ernst04]• Razor latches• Speculate low delay• Detect violations

– Early/late mismatch• Recover by stalling

– Requires fast “global” signal

– Alternative – flush

• Requires extra ~10% logic • Path delay restrictions:

Δ < t < Δ+cycle


7


Outline• Setup:

SIMD architecture + proportional margining• Proportional margining w/ timing speculation• Timing speculation with SIMD


8


SIMD leads to inefficient timing speculation


9


SIMD leads to inefficient timing speculation


0.50.60.70.80.9

1

0 0.02 0.04 0.06 0.08 0.1

Frac

tion

of p

eak

thro

ughp

ut

Probability of an error in a single stage, single lane

SISD16-wide SIMD32-wide SIMD

10


Decoupled Parallel SIMD Pipeline (DPSP)• Shallow FIFO for control (or between stages)

11


0.50.60.70.80.9

1

0 0.02 0.04 0.06 0.08 0.1

Frac

tion

of p

eak

thro

ughp

ut

Probability of an error in a single stage, single lane

SISD32-wide SIMD32-wide DPSP

Decoupled Parallel SIMD Pipeline (DPSP)• Decoupling mitigates SIMD impact

12

DPSP challenge 1: inter-lane communication• Decoupling may delay producer (store)• Micro barriers

– Enforce SIMD semantics• Not a problem in practice

with GPUs– Execution model requires

explicit sync across CTAs / work-groups


13


DPSP challenge 2: memory access locality• Loads and stores no longer aligned

– Memory “divergence”• May increase pressure on on-chip memory

access• May impact off-chip access

– Old NVIDIA hardware had memory coalescing issues– No Problem with coalescing buffers and caches

• Micro-barriers if problematic– Can be done implicitly or explicitly in hardware– Sync before every load– Prediction

14


Outline• Setup:



15

Evaluation flow

Error Measurements

Error Probability Model

Energy-Efficiency Model

Design Space Exploration

Arch Sim. Validation


16

Measuring error rate• Inherently circuit and

implementation dependent• Used 3 exemplary circuits

– SPICE-simulated adder [Ernst04]

– FPGA-modeled multiplier [Ernst04]

– Multiplier fabricated in our IBM 45nm SOI test chip[Pawlowski12]


Pawlowski ISSCC’12

17


Modeling the error rate function• 2-parameter model

errVmax

Slope

Adder [Ernst04]Mul. [Ernst04]

18


ET2 energy-efficiency metric• Energy x (execution)Time2

– In circuit context: time=delay -> ED2

• Isolates architecture efficiency – Independent of DVFS– Shows improvements in addition to DVFS

2ddVE

ddVt 1

19


Simple ET2 model• Throughput (1/T):

• Relative energy:Dynamic Static

20


GP-GPU simulation adds some realism• Baseline uses ideal margins without

specuation– Only max delay vs. typical delay left on table– Timing speculation overhead is 0 – 15% ET2

• GPGPUSim (version 2.1)– Cycle-based extendable GP-GPU simulator from UBC

• Developer-recommended parameters• Extended to DPSP

– Recovery through stall– Micro-barrier options

• Explicit CTA/workgroup synchronization only (no mbarriers)• Implicit sync before every memory operation

• Power model based on Hong & Kim, ISCA’10

21


Outline• Setup:



– Design-space exploration– Architecture effects

22


ET2 vs. SIMD (no spec.)

DPSP

errVmax

Slope


• DPSP

*- Relative ET2 - lower elevation is better

23


DPSP vs. SIMD (w/ spec.)

*- ET2 Difference - higher elevation is better

errVmax

Slope


• SIMD – DPSP

24

Bringing in architecture effectsSIMD Lane Decoupling (C) M. Erez, E. Krimer

Adder

Fabricated MUL

25


Summary• Design margins inefficiency• Naive timing speculation with SIMD is inefficient• DPSP enables efficient speculation in SIMD

– Microbarriers maintain semantics when necessary– With GPU, frequent mbarriers help memory access

• Simple models can capture error response– Error rate exponential with Vdd– Dependent on circuit and implementation

• Design-space exploration shows potential– When and why timing speculation should (not) be used– DPSP consistently improves ET2 (10 – 45%)– DPSP achieves 10 – 20% better ET2 than SIMD w/ spec.

26

BACKUP


27


Detailed ET2 vs. Vdd behaviorNN AES

BFSMUM

28

Frequent micro-barriers improve ET2


Adder

Multiplier

Fab.

29

Modeling the error rate functionSIMD Lane Decoupling (C) M. Erez, E. Krimer

errVmax

Slope


30


Proportional margining• Static margin control

– Binning– Vdd/frequency/biasing adjustment

• Dynamic margin control– Vdd/frequency/biasing for slowly varying effects

• Temperature and aging– Clocking tricks

• From GALS to dynamic and elastic clockingM

axim

um

logi

c de

lay

Noi

se g

uard

-ba

ndW

earo

ut

guar

d-ba

nd

Proc

ess

vari

atio

n gu

ard-

band

Cloc

k Sk

ew

and

jitte

r

Typical logic delay

time

Other

31

Detailed results summary• BFS

– High divergence rate– Requires implicit synchronizations– Limits DPSP opportunities

• CP,DG,RAY– Sensitive to memory coalescing– Synchronization between memory operations solves it

• MUM– Low SIMD occupancy limits the benefit of decoupling

• WP– Not enough registers, lots of memory spills.– Extremely sensitive to memory latency and the exact

scheduling – disturbed by DPSP


Documents

SIMD Lane Decoupling Improved Timing-Error Resilience