31
SIMD Lane Decoupling Improved Timing-Error Resilience Evgeni Krimer (UT Austin) Patrick Chiang (Oregon State) Mattan Erez (UT Austin)

SIMD Lane Decoupling Improved Timing-Error Resilience

  • Upload
    bary

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

SIMD Lane Decoupling Improved Timing-Error Resilience. Evgeni Krimer (UT Austin) Patrick Chiang (Oregon State) Mattan Erez (UT Austin). All systems power/energy bound. The good: Transistor still following Moore’s Law The bad: Transistor power efficiency improving too slowly - PowerPoint PPT Presentation

Citation preview

Page 1: SIMD Lane Decoupling Improved Timing-Error Resilience

SIMD Lane DecouplingImproved Timing-Error Resilience

Evgeni Krimer (UT Austin)Patrick Chiang (Oregon State)Mattan Erez (UT Austin)

Page 2: SIMD Lane Decoupling Improved Timing-Error Resilience

2

All systems power/energy bound• The good:

– Transistor still following Moore’s Law• The bad:

– Transistor power efficiency improving too slowly– Larger fraction of power to non-compute resources

• The conclusion:– Better algorithms– More efficient architectures– Proportionality: waste less of what you have

• This paper: SIMD + timing speculation– Efficient architecture + proportional guardbands

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Page 3: SIMD Lane Decoupling Improved Timing-Error Resilience

3

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Outline• Setup:

efficient architecture + proportional margining• Proportional margining w/ timing speculation• Timing speculation with SIMD

– Problem and DPSP solution• Methodology and modeling• Evaluation

Page 4: SIMD Lane Decoupling Improved Timing-Error Resilience

4

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Voltage/timing margins “waste” energy• Illustrative only – not to scale

Max

imum

lo

gic

dela

y

Noi

se g

uard

-ba

ndW

earo

ut

guar

d-ba

nd

Proc

ess

vari

atio

n gu

ard-

band

Tem

pera

ture

…Typical

logic delay

Toda

y

time (1 cycle)

Page 5: SIMD Lane Decoupling Improved Timing-Error Resilience

5

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Voltage/timing margins “waste” energy• Illustrative only – not to scale

Max

imum

lo

gic

dela

y

Noi

se g

uard

-ba

ndW

earo

ut

guar

d-ba

nd

Proc

ess

vari

atio

n gu

ard-

band

Tem

pera

ture

Max

imum

logi

c de

lay

Noi

se g

uard

-ba

nd

Wea

rout

gu

ard-

band

Proc

ess

vari

atio

n gu

ard-

band

Tem

pera

ture

Typical logic delay

Typical logic delay

Toda

y

time (1 cycle)

Futu

re

Page 6: SIMD Lane Decoupling Improved Timing-Error Resilience

6

Timing speculation to the rescue [Ernst04]• Razor latches• Speculate low delay• Detect violations

– Early/late mismatch• Recover by stalling

– Requires fast “global” signal

– Alternative – flush

• Requires extra ~10% logic • Path delay restrictions:

Δ < t < Δ+cycle

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Page 7: SIMD Lane Decoupling Improved Timing-Error Resilience

7

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Outline• Setup:

SIMD architecture + proportional margining• Proportional margining w/ timing speculation• Timing speculation with SIMD

– Problem and DPSP solution• Methodology and modeling• Evaluation

Page 8: SIMD Lane Decoupling Improved Timing-Error Resilience

8

SIMD Lane Decoupling (C) M. Erez, E. Krimer

SIMD leads to inefficient timing speculation

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Page 9: SIMD Lane Decoupling Improved Timing-Error Resilience

9

SIMD Lane Decoupling (C) M. Erez, E. Krimer

SIMD leads to inefficient timing speculation

SIMD Lane Decoupling (C) M. Erez, E. Krimer

0.50.60.70.80.9

1

0 0.02 0.04 0.06 0.08 0.1

Frac

tion

of p

eak

thro

ughp

ut

Probability of an error in a single stage, single lane

SISD16-wide SIMD32-wide SIMD

Page 10: SIMD Lane Decoupling Improved Timing-Error Resilience

10

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Decoupled Parallel SIMD Pipeline (DPSP)• Shallow FIFO for control (or between stages)

Page 11: SIMD Lane Decoupling Improved Timing-Error Resilience

11

SIMD Lane Decoupling (C) M. Erez, E. Krimer

0.50.60.70.80.9

1

0 0.02 0.04 0.06 0.08 0.1

Frac

tion

of p

eak

thro

ughp

ut

Probability of an error in a single stage, single lane

SISD32-wide SIMD32-wide DPSP

Decoupled Parallel SIMD Pipeline (DPSP)• Decoupling mitigates SIMD impact

Page 12: SIMD Lane Decoupling Improved Timing-Error Resilience

12

DPSP challenge 1: inter-lane communication• Decoupling may delay producer (store)• Micro barriers

– Enforce SIMD semantics• Not a problem in practice

with GPUs– Execution model requires

explicit sync across CTAs / work-groups

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Page 13: SIMD Lane Decoupling Improved Timing-Error Resilience

13

SIMD Lane Decoupling (C) M. Erez, E. Krimer

DPSP challenge 2: memory access locality• Loads and stores no longer aligned

– Memory “divergence”• May increase pressure on on-chip memory

access• May impact off-chip access

– Old NVIDIA hardware had memory coalescing issues– No Problem with coalescing buffers and caches

• Micro-barriers if problematic– Can be done implicitly or explicitly in hardware– Sync before every load– Prediction

Page 14: SIMD Lane Decoupling Improved Timing-Error Resilience

14

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Outline• Setup:

efficient architecture + proportional margining• Proportional margining w/ timing speculation• Timing speculation with SIMD

– Problem and DPSP solution• Methodology and modeling• Evaluation

Page 15: SIMD Lane Decoupling Improved Timing-Error Resilience

15

Evaluation flow

Error Measurements

Error Probability Model

Energy-Efficiency Model

Design Space Exploration

Arch Sim. Validation

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Page 16: SIMD Lane Decoupling Improved Timing-Error Resilience

16

Measuring error rate• Inherently circuit and

implementation dependent• Used 3 exemplary circuits

– SPICE-simulated adder [Ernst04]

– FPGA-modeled multiplier [Ernst04]

– Multiplier fabricated in our IBM 45nm SOI test chip[Pawlowski12]

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Pawlowski ISSCC’12

Page 17: SIMD Lane Decoupling Improved Timing-Error Resilience

17

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Modeling the error rate function• 2-parameter model

errVmax

Slope

Adder [Ernst04]Mul. [Ernst04]

Page 18: SIMD Lane Decoupling Improved Timing-Error Resilience

18

SIMD Lane Decoupling (C) M. Erez, E. Krimer

ET2 energy-efficiency metric• Energy x (execution)Time2

– In circuit context: time=delay -> ED2

• Isolates architecture efficiency – Independent of DVFS– Shows improvements in addition to DVFS

2ddVE

ddVt 1

Page 19: SIMD Lane Decoupling Improved Timing-Error Resilience

19

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Simple ET2 model• Throughput (1/T):

• Relative energy:Dynamic Static

Page 20: SIMD Lane Decoupling Improved Timing-Error Resilience

20

SIMD Lane Decoupling (C) M. Erez, E. Krimer

GP-GPU simulation adds some realism• Baseline uses ideal margins without

specuation– Only max delay vs. typical delay left on table– Timing speculation overhead is 0 – 15% ET2

• GPGPUSim (version 2.1)– Cycle-based extendable GP-GPU simulator from UBC

• Developer-recommended parameters• Extended to DPSP

– Recovery through stall– Micro-barrier options

• Explicit CTA/workgroup synchronization only (no mbarriers)• Implicit sync before every memory operation

• Power model based on Hong & Kim, ISCA’10

Page 21: SIMD Lane Decoupling Improved Timing-Error Resilience

21

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Outline• Setup:

efficient architecture + proportional margining• Proportional margining w/ timing speculation• Timing speculation with SIMD

– Problem and DPSP solution• Methodology and modeling• Evaluation

– Design-space exploration– Architecture effects

Page 22: SIMD Lane Decoupling Improved Timing-Error Resilience

22

SIMD Lane Decoupling (C) M. Erez, E. Krimer

ET2 vs. SIMD (no spec.)

DPSP

errVmax

Slope

Adder [Ernst04]Mul. [Ernst04]

• DPSP

*- Relative ET2 - lower elevation is better

Page 23: SIMD Lane Decoupling Improved Timing-Error Resilience

23

SIMD Lane Decoupling (C) M. Erez, E. Krimer

DPSP vs. SIMD (w/ spec.)

*- ET2 Difference - higher elevation is better

errVmax

Slope

Adder [Ernst04]Mul. [Ernst04]

• SIMD – DPSP

Page 24: SIMD Lane Decoupling Improved Timing-Error Resilience

24

Bringing in architecture effectsSIMD Lane Decoupling (C) M. Erez, E. Krimer

Adder

Fabricated MUL

Page 25: SIMD Lane Decoupling Improved Timing-Error Resilience

25

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Summary• Design margins inefficiency• Naive timing speculation with SIMD is inefficient• DPSP enables efficient speculation in SIMD

– Microbarriers maintain semantics when necessary– With GPU, frequent mbarriers help memory access

• Simple models can capture error response– Error rate exponential with Vdd– Dependent on circuit and implementation

• Design-space exploration shows potential– When and why timing speculation should (not) be used– DPSP consistently improves ET2 (10 – 45%)– DPSP achieves 10 – 20% better ET2 than SIMD w/ spec.

Page 26: SIMD Lane Decoupling Improved Timing-Error Resilience

26

BACKUP

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Page 27: SIMD Lane Decoupling Improved Timing-Error Resilience

27

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Detailed ET2 vs. Vdd behaviorNN AES

BFSMUM

Page 28: SIMD Lane Decoupling Improved Timing-Error Resilience

28

Frequent micro-barriers improve ET2

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Adder

Multiplier

Fab.

Page 29: SIMD Lane Decoupling Improved Timing-Error Resilience

29

Modeling the error rate functionSIMD Lane Decoupling (C) M. Erez, E. Krimer

errVmax

Slope

Adder [Ernst04]Mul. [Ernst04]

Page 30: SIMD Lane Decoupling Improved Timing-Error Resilience

30

SIMD Lane Decoupling (C) M. Erez, E. Krimer

Proportional margining• Static margin control

– Binning– Vdd/frequency/biasing adjustment

• Dynamic margin control– Vdd/frequency/biasing for slowly varying effects

• Temperature and aging– Clocking tricks

• From GALS to dynamic and elastic clockingM

axim

um

logi

c de

lay

Noi

se g

uard

-ba

ndW

earo

ut

guar

d-ba

nd

Proc

ess

vari

atio

n gu

ard-

band

Cloc

k Sk

ew

and

jitte

r

Typical logic delay

time

Other

Page 31: SIMD Lane Decoupling Improved Timing-Error Resilience

31

Detailed results summary• BFS

– High divergence rate– Requires implicit synchronizations– Limits DPSP opportunities

• CP,DG,RAY– Sensitive to memory coalescing– Synchronization between memory operations solves it

• MUM– Low SIMD occupancy limits the benefit of decoupling

• WP– Not enough registers, lots of memory spills.– Extremely sensitive to memory latency and the exact

scheduling – disturbed by DPSP

SIMD Lane Decoupling (C) M. Erez, E. Krimer