42

Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 2: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 3: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 4: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 5: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 6: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Power  Reduc+on  Techniques  

•  Stop  the  clock  – Dynamic  power  reduc+on  

•  Power  ga+ng  – Reduce  the  leakage  

•  How  fast  can  you  turn  something  on/off?  – Nothing  to  do  à  sleep  

•  How  can  you  save  power  while  in  opera+on?  – Near-­‐threshold  design    

Page 7: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

What  is  the  overhead  of  shuEng-­‐down?  Versus  just  leEng  it  leak?  

Power  Ga+ng  

Page 8: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 9: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 10: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Kevin  Nowka,  IBM  

Page 11: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 12: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 13: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 14: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 15: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 16: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Gate  Leakage  

Page 17: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 18: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Digital  Paralleliza+on  Y[n]  =  X[n]  +  αX[n-­‐1]  

Input  (5bits  @  5GS/s)  

clk   clk  

X[n]  X[n-­‐1]  

Y[n]  +  

 x  

α  

Clk  =  5GHz  

Analog  Signal  

Input  (5bits  @  5GS/s)  

 Or    

(8bits  @  100MHz)      

ANALOG   DIGITAL  

Page 19: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

DSP  Paralleliza+on    Y[n]  =  X[n]  +  αX[n-­‐1]  

Input  (5bits  @  5GS/s)  

clk  

clk  

X[n]  X[n-­‐2]  

+  

 x  

α  

Y[n-­‐1]  =  X[n-­‐1]  +  αX[n-­‐2]  

clk  

clkb  

CLK  =  5GHz  

clk  

X[n-­‐1]  

Y[n]  

Y[n-­‐1]  +  

 x  

CLK  =  2.5GHz  

α  

Page 20: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

DSP  Paralleliza+on  •  Clock  speed  reduced  by  ½  –  Can  parallelize  further  –  Increase  number  of  MACs(mul+ply/accumulates)  by  2  

•  Intui+on?  –  Area  goes  up  by  2  –  Power  decreases  (clock  rate  down  by  2,  computa+ons  up  by  2,  but  easier  +ming  constraints)  

– What  about  clock  power?  

•  Save  a  liele  power,  but  double  the  area?  

Page 21: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Razor:  A  Low-­‐Power  Pipeline  Based  on  Circuit-­‐Level  Timing  SpeculaNon  

•  hep://www.eecs.umich.edu/~taus+n/papers/MICRO36-­‐Razor.pdf  

Page 22: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 23: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 24: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 25: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 26: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 27: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

0 0.2 0.4 0.6 0.8 110-10

10-9

10-8

10-7

10-6

10-5

10-4Delay vs. Vdd

Supply Voltage (V)Q

uant

izer

Del

ay (s

)0 0.2 0.4 0.6 0.8 1

10-16

10-15

10-14

10-13

Supply Voltage (V)

Ener

gy (J

)

Energy/Conv-step vs. Vdd

TotalDynamicStatic

Low-Speed Large Circuit Parallelism

1:N Multiplexing High Energy-Efficiency

Near-­‐VT  Opera+on  

Page 28: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Near-­‐VT  Design  Margins  •  Significant  energy/performance  lost  to  guardbands    

Maxim

um    

logic  de

lay  

Noise  gua

rd-­‐ban

d  

Wearout    

guard-­‐ba

nd  

Process  v

ariaNo

n    

guard-­‐ba

nd  

Clock  Skew

   an

d  jiU

er  

Maxim

um  

logic  de

lay  

Noise  gua

rd-­‐ban

d  

Wearout    

guard-­‐ba

nd  

Process  v

ariaNo

n    

guard-­‐ba

nd  

Clock  Skew

   an

d  jiU

er  

Typical    logic  delay  

Typical    logic  delay  

Toda

y  Future  

 Nme  

Guardbands  

Guardbands  

Page 29: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Near-­‐VT  :  The  Variability  Challenge  

Motivating Example: Medical Electronics. Power consumption critically constrains med-ical devices, as the battery typically limits the system cost, safety, and lifetime [12]. For example,my collaborator Professor B. Chi at Tsinghua University (China) and I designed a swallowablewireless endoscope capsule [10,11]. Each endoscope System-on-a-Chip (SoC) incorporates a large-arrayed CMOS image sensor, an on-die digital signal processor, and a low-data-rate RF wirelesstransceiver. Unfortunately, the RF transceiver dissipates more than 90% of the total power budget.Because image sensing algorithms are highly parallel, high throughput computing performed locallywithin the endoscope itself would significantly reduce the transmitted RF downlink bandwidth andhence significantly improve battery life. Alternatively, it would enable us to partition more powerto the 2D-data acquisition in the image-sensing analog front-end.

In a more exotic, energy-constrained medical application, next-generation brain implants ex-ploit arrays of hundreds of electrodes [20, 21] to analyze nerve potentials, enabling new scientificunderstanding and possible neurological treatments. Limits on battery capacity and heat genera-tion [21,57] restrict brain implants to low power levels of several mW, but the implants still requiresignificant computational capabilities to reduce the wireless transmission power [22, 23]. Recentwork [56] suggests that with 1024 neural channels, local on-chip DSP algorithms such as featureextraction and clustering can improve total system power by factors of 4.3 and 26, respectively. Inaddition to the large amount of local DSP processing, future medical devices will also require largecomputational throughput for performing data encryption, compression, and wireless basebandprocessing.

!"#$%&'()&*+,(!-.,/#0(

(1"2((( (((((((((((((((((( ( ( ( ( ( ( ( ( ( (((((((((((132(

0

5e-09

1e-08

1.5e-08

2e-08

2.5e-08

3e-08

400 450 500 550 600 650

Dela

y [s]

Vdd [mV]

min

max

avg std

0 2e-09 4e-09 6e-09 8e-09 1e-08

1.2e-08 1.4e-08 1.6e-08

Dela

y [s]

4-5"$(102(

!44(6(789!(

(((((((!44(1:!2(

!

!

0.20 0.25 0.30 0.35 0.40 0.45 0.503

4

5

6

7

8

9

10

100

1000

EnergyF

req

uen

cy (

kH

z)

En

erg

y/In

st

(pJ)

Vdd

(V)

Vdd

=350m V,

3.52pJ/inst, 354kHz

Frequency

Figure 3: Subliminal processor frequency and energy

breakdowns at various supply voltages.

increase in a near-exponential fashion. This rise in leakage energy eventually dominates any reduction in switching energy, creating an energy minimum seen in!Figure 2.

The identification of an energy minimum has led to interest in processors that operate at this energy optimal supply voltage [12,14,17] (referred to as Vmin

and typically 250mV-350mV). However, the energy minimum is relatively shallow. Energy typically reduces by only ~2X when Vdd is scaled from the near-threshold regime (400-500mV) to the subthreshold regime, though delay rises by 50-100X over the same region. While acceptable in ultra-low energy sensor-based systems, this delay penalty is not tolerable in a broad set of applications. Hence, although introduced roughly 30 years ago, ultra-low voltage design remains confined to a small set of markets with little or no impact on mainstream semiconductor products.

3. NTC Analysis: Recent work at many leading institutions has produced working processors that operate at subthreshold voltages. For instance, the Subliminal processor [17] designed by Hanson et al. provides the opportunity to clearly quantify the NTC region and how it compares to the subthreshold region. Figure 3 presents the energy breakdown of the design as well as the operating frequency achieved across a range

of voltages. As was discussed in Section 2, there is a Vmin operating point that occurs in the subthreshold operating region but is tied to operating points of less than 1MHz. On the other hand, only a modest increase in energy is seen operating at the NTC region (around .5V), while frequency characteristics at that point are significantly better. At nominal operating points Subliminal operates at 20.5 MHz and 33.1 pJ/inst, showing approximately a 6.6x reduction in energy and an 11.4x reduction in frequency at the NTC operating point.

!

4. NTC Barriers: Although NTC provides for excellent energy-frequency tradeoffs, it doesn’t come without its own set of complications. NTC faces three key barriers that must be overcome for widespread use, performance loss, performance variation, and even functional failure. In the following subsections we will discuss why each of these exist and why they pose problems to the wide spread adoption of NTC.

4.1. Performance loss. The performance loss observed in NTC, while not as severe as that in subthreshold operation, poses one of the most formidable challenges for NTC viability. In an industrial 45nm technology the fanout-of-four inverter (FO4) delay at 400mV is 10X slower than at the nominal 1.1V. There have been several recent advances of architectural and circuit techniques that can be used to improve performance in the NTC regime. These techniques, described in detail in Section 5.1, center around aggressive parallelism with a novel NTC oriented memory/computation hierarchy. The increased communication needs in these architectures is supported by the application of 3D chip integration, as made feasible by the low power density of NTC circuits. In addition a new

Figure 2: Energy and delay in different supply voltage

operating regions.

Figure 1: (a) E�ect of lowered Vdd on the energy consumed/computation [13] and logic delay in0.13um-CMOS; (b) Box plot of the delay uncertainty across Monte Carlo variations while scalingsupply voltage, and for varying input values at Vdd=0.5V in 90nm-CMOS

1.2 Insight: Near-threshold operation saves energy, but has reliability limitations

In all of these examples, traditional low-power techniques cannot achieve the necessary e�ciencywhile still achieving the required performance. The microelectronics community has previouslyshown that sub- or near-threshold circuits achieve optimal energy and power performance wherethe dynamic energy and leakage energy dissipated per cycle are balanced [31]. This approach canimprove energy e⇥ciency by 10x to 20x, but only while running at very low frequencies of 0.1-10MHz, depending on the complexity of the circuit (Figure 1a). Furthermore, near-threshold designssu�er from very high process variability that can severely degrade performance and e⇥ciency by

3

Page 30: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

16b-­‐CLA,  90nm-­‐CMOS,  0.5V  In the NTV regime, it is extremely di�cult to insure

that these timing constraints will be met in the presenceof large process variations. Characterizing an entire pro-cessor and its worst-case delay path under Monte Carlo isusually out of the question. In [7], a monte carlo approx-imation is proposed to ensure correct timing under NTVvariations. These previous timing characterization meth-ods, however, may still be much di↵erent than the actualsilicon under test. For example, the 0.6V DSP in [7] waspre-silicon characterized at 14MHz after variations and ag-ing margins, but after fabrication, was able to successfullyoperate at 43.4MHz . With continued process scaling, guar-anteeing post-fabricated operation that correlates well withpre-silicon simulation is becoming more challenging, espe-cially for NTV operation.

3.2 Error DetectionA di↵erent solution to combating variations is in-situ er-

ror detection. After digital circuits are synthesized, an errordetection method is employed that detects if the timing con-straints are met during normal chip operation. Therefore,the design is not required to undergo simulation of every pos-sible variation-induced timing error that might occur underNTV.

The most common error detection method is Razor [6], orrelated timing speculative approaches [2]. With Razor, eachdatapath flip-flop is modified similar to the circuit shown inFigure 1. Each flop-flop is compared with the value stored ina ’shadow’ latch that is sampled by a delayed clock, typicallyaround 20% after the main clock edge. If the outputs havenot settled to their final result before the main clock edge,they will be caught in this delayed latch afterward. AfterXOR comparison, if the two values are di↵erent, an error isflagged.

Figure 1: Razor flip-flops with global error flag

These errors are then managed by an error recovery mech-anism to ensure that all instructions are executed correctlyafter an error. There are many di↵erent recovery meth-ods, with di↵erent recovery speeds and energy overheads [3].The clearest advantage of Razor is the potential for approx-imately a 20% speed improvement. Using this method al-lows for operation faster than the worst-case STA, improvingboth throughput and energy.

However, Razor circuits do exhibit several disadvantagesin the NTV regime. First o↵, the addition of an extra set oflatches to the datapath increases energy, significantly im-pacting any energy improvements due to increased clockspeed. Second, in order to guarantee that the 20% win-dow can correctly catch all errors, minimum delay bu↵ersmust be added to all min-delay fast paths in the logic, inorder to eliminate race conditions that prevent Razor fromoperating correctly. These bu↵ers add to the energy over-

5 10 15 20 25 30 350

10

20

30

40

Adder Delay (ns)

# of

Occ

urre

nces

Chosen Monte Carlo Case

(a)

5 10 15 20 25 30 350

50

100

150

Adder Delay (ns)

# of

Occ

urre

nces

Wor

st-c

ase

STA

Razo

rTA

CD

CSCD

(b)Figure 2: histograms of (a) monte carlo chip-to-chip delay of theSTA and (b) delay of changing FIR filter data on a 16-bit adderwith error-detection speeds marked.

head of Razor as well. Hence, while it is common to applyRazor and min-delay path insertion into only a subset of thelogic paths at super-threshold, at near-threshold, this maynot guarantee correct operation due to the unknown delaydistribution. The final disadvantage with Razor, specificallyin the NTV regime, is its inability to improve throughputbeyond approximately 20%, limited by the min-path raceconditions.1

4. VARIATION STUDYIn order to explore the e↵ects of process variations in the

NTV regime, a 16-bit Carry-Lookahead (CLA) adder wassynthesized in a 90nm IBM CMOS process. Figure 2a showsthe histogram of a 500-point Monte-Carlo simulation of theCLA adder, where the inputs are the worst-case static tim-ing analysis (STA) vectors determined by the synthesizer.The figure shows a large standard deviation in delay whileoperating at a NTV voltage of 500mV. In addition to processvariation-induced timing uncertainty is input vector varia-tion. Figure 2b shows the simulated delays of the same adderfor one particular Monte-Carlo simulation, where the inputvectors are supplied from the outputs of a FIR filter. Fur-ther, these vector-to-vector timing variations worsen as thecircuits are driven farther into the NTV regime.

Figure 3 shows the potential speedup that can be achievedwith the ability to ideally detect errors. For this simula-tion, we chose the worst-performing Monte Carlo case fromthe 500 that were simulated, resulting in a worst-case clockspeed of 32MHz (as opposed to a best-case speed of 166MHz ).Next, 1000 add-vectors from a low-pass FIR filter using elec-troencephalography (EEG) data were extracted from a Mat-lab simulation, and then simulated with the 16-bit carry-

1The theoretical maximum delay window for Razor is 50%.Unfortunately, NTV operation may exhibit delays that spanmuch larger than 50$.

500-point Monte-Carlo

Worst-Case STA

Single Worst Monte Carlo

Random Input Vectors

Page 31: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Razor  Timing  Specula+on  [Ernst  04]    

•  Popular  mechanism  to  ‘speculate’  on  logic  comple+on  •  Detects  +ming  viola+ons  (typ:  ~20%)  •  Requires  extra  logic    –  Inverter  buffering  of  min-­‐delay  paths  

Page 32: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Conven+onal  Razor  Scalar  Processors  •  Architecture:  –  Typically  op+mized  for    general-­‐purpose  control  •  Single  instruc+on/cycle  •  Deep  pipelining  for  ILP  •  Global  ERROR  signal      

•  Razor  Circuits:  –  Limited  +ming  specula+on    •  Min-­‐path  race  can  be  erroneous  •  Avg.  specula+on  ~  20-­‐25%  •  Max.  specula+on  =  50%    

   

 

ARM-Michigan, ISSCC-2010

J. Low Power Electron. Appl. 2011, 1 342

3.1. Clock Gating

Clock Gating is conceptually the simplest technique to implement of all error recovery methods.Its original purpose was for saving power on unused blocks on a systems level by not clocking themwhen they are not used. This technique can also be adapted to error recovery by pausing all pipelinestages while waiting for the slow stage to either finish computation or to allow for the instruction tobe re-executed. The pausing action ensures that later instructions do not continue to their next pipelinestage until the errant instruction is corrected. It is most commonly paired with Razor flip-flops as it onlyworks if the pipeline can be stalled before the next clock edge, before the pipeline registers are set to getnew data which can be achieved in slow systems. The Clock Gating concept is illustrated in Figure 7.

Figure 7. (a) Pipeline modification for Clock Gating error recovery method; (b) ClockGating pipeline data path with errors.

IF FF ID FF EX FF MEM FF FF WB

clk

error error error

ST

error

PC

(a)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

IF I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I20 I21 I22

ID I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I11 I12 I13 I14 I15 I16 I17 I18 I19 I20 I21

EX I1 I2 I3 I4 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I17 I18 I19 I20

MEM I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I16 I17 I18 I19

ST I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I18

WB I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17

Error in EX of Inst. 4 Error in ID of Inst. 11 Error in 2 stages at once

ST

ST

ST

ST

ST

ST

ST

ST

ST

ST

ST

ST

ST

ST

(b)

The primary advantage to this method is that it requires very little architectural changes as well asminimal area addition to a design compared with other methods. However, in order for this method towork properly, a stall signal needs to propagate to all pipeline stages in a very short amount of time (50%of one clock cycle when Razor circuits are used). This can be difficult to achieve across large CMOSdies where pipeline stages are several millimeters apart. Furthermore, this is completely impractical toimplement in complicated microprocessors because it may take several clock cycles just to propagate theclock signal through a clock distribution network which cannot be halted in only one cycle.

3.2. Counterflow Pipelining

Traditional Counterflow Pipelining is a microarchitecture technique that uses a bidirectional pipeline,allowing instructions to flow forward and results to flow backward. This technique made it easier toimplement operand forward, register renaming, and most importantly pipeline flushing [13,17]. In order

VDD 0.9-1.1V fCLK = 1GHz in 65nm

Page 33: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Data-­‐Parallel  SIMD  Architectures  •  SIMD:  single-­‐instruc+on,  mul+ple-­‐data    •  Mul+ple  execu+on  units  share  

sequencer  and  control  path    

•  Sharing  control  path  amor+zes  its    energy  overhead  among  lanes    

•  GPU  is  a  popular  example  of  SIMD  

Page 34: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Timing  Errors  in  SIMD  Architecture  

STALL!

Page 35: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Error  Impact  on  SIMD  Architecture  

0.50.60.70.80.91

0 0.02 0.04 0.06 0.08 0.1

Fractio

n  of  peak  

throug

hput

Probability  of  an  error  in  a  single  stage,  single  lane

SISD16-­‐wide  SIMD32-­‐wide  SIMD

Page 36: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Proposed  10-­‐lane  SIMD  Architecture  

IQ ControlStall IQ

0

0

Next InstDQFull Bar [9:0] Bar

Stall [9:0]

lw (barrier)umul $r1, $r2

Stall DQLane 9

...

...

...

xor $r7, $r4msub $r2, $r3

lw (barrier)umul $r1, $r2

add $r9, $r7

0

RAZOR

3:1

RF

ALU

3:1

IQ_out [2]IQ_out [3]

ALU_out [1]ALU_out [2]

lw (barrier)

RAZOR

3:1

RF

ALU

3:1

NextInst Full Bar[1] Pass Next

Inst Full Bar[9] Pass

RAZOR

3:1

RF

ALU

3:1

IQ_out [9]IQ_out [8]

ALU_out [9]ALU_out [8]

0

RF_out [1]RF_out [2]

RF_out [8]RF_out [9]

Stall DQLane 1

Stall DQLane 0

ALU_wb [0] ALU_wb [1] ALU_wb [9]

NextInst Full Bar[0] Pass

RAZOR RAZOR RAZOR

WBALU/RF Weave

1 cycle

InstructionFetch

1 cycle

IQ WeaveRF Access

1 cycle

Execute

2 cycles

DecouplingQueues

1 cycle

Synctium Pipeline

and $r9, $r1teq $r8, $r5

msub $r2, $r3xor $r7, $r4

umul $r1, $r2

add $r9, $r7

madd $r1, $r3

add $r5, $r6or $r3, $r7

. . .mul $r2, $r1

sll $r4, $r2

InstructionQueue

0

127

Page 37: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Decoupled  Parallel  SIMD  Pipeline  

STALL!

Decoupling synchronous ‘async’ between lanes!

Page 38: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Decoupled  Parallel  SIMD  Pipeline  (DPSP)    

0.50.60.70.80.91

0 0.02 0.04 0.06 0.08 0.1

Fractio

n  of  peak  

throug

hput

Probability  of  an  error  in  a  single  stage,  single  lane

SISD32-­‐wide  SIMD32-­‐wide  DPSP

Page 39: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Delay  Characteriza+on  

0.53 0.6 0.7 0.8 1.0

23456789

101112

VDD (V)

Del

ay (n

s)

4

6

8

10

10-point 100-pointD

elay

(ns)

Variation vs. Sample Size(Lane 1, 0.53V)

Page 40: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Lane  Weaving  - Eliminating lanes improves total throughput

0 1 2 3 4 5 6 7 8 950

100

150

200

250

300

350

400

450

500

Lane

Max

Ope

ratin

g Fr

eque

ncy

(MH

z)

35% Frequency Δ

6% Frequency Δ

2% Frequency Δ

9% Frequency Δ

8% Frequency Δ

0 1 2 3 4 5

0.53

0.60

0.70

0.80

1.0

Throughput (GOPS)

9 Lanes, 1 Spare10 Lanes, 0 Spares

8 Lanes, 2 Spares

4.35

3.563.96

3.25

2.762.99

1.55

1.361.46

0.85

0.920.86

2.53

2.182.32

VDD = 0.53V VDD = 0.6V VDD = 0.7VVDD = 0.8V VDD = 1.0V

V DD(V

)

Page 41: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon
Page 42: Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Imagine

SPI Storm 1

Subliminal

Phoenix

SODA(90nm) AnySP‡

TILE64

Inteli5-67

ARMCortex A9

ATI HD 5970

103

106

109

1012

10-9 10-6 10-3 1 102

Scale VTIntel SIMD accelerator TI

LP DSP

Peak

Per

form

ance

[O

P†/s

ec]

Power [W]

VDD=1VVDD=0.6V

VDD=0.53V

Sub-­‐Threshold  Low-­‐Performance  

Super-­‐Threshold  High-­‐Performance  

Near-­‐Threshold  Medium-­‐Performance  

 1st Near-VT Parallel-SIMD Processor with Variation Resiliency