Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon

Power Reduc+on Techniques

•  Stop the clock – Dynamic power reduc+on

•  Power ga+ng – Reduce the leakage

•  How fast can you turn something on/off? – Nothing to do à sleep

•  How can you save power while in opera+on? – Near-‐threshold design

What is the overhead of shuEng-‐down? Versus just leEng it leak?

Power Ga+ng

Kevin Nowka, IBM

Gate Leakage

Digital Paralleliza+on Y[n] = X[n] + αX[n-‐1]

Input (5bits @ 5GS/s)

clk clk

X[n] X[n-‐1]

Y[n] +

x

α

Clk = 5GHz

Analog Signal


Or

(8bits @ 100MHz)

ANALOG DIGITAL

DSP Paralleliza+on Y[n] = X[n] + αX[n-‐1]


clk

clk

X[n] X[n-‐2]

+

x

α

Y[n-‐1] = X[n-‐1] + αX[n-‐2]

clk

clkb

CLK = 5GHz

clk

X[n-‐1]

Y[n]

Y[n-‐1] +

x

CLK = 2.5GHz

α

DSP Paralleliza+on •  Clock speed reduced by ½ –  Can parallelize further –  Increase number of MACs(mul+ply/accumulates) by 2

•  Intui+on? –  Area goes up by 2 –  Power decreases (clock rate down by 2, computa+ons up by 2, but easier +ming constraints)

– What about clock power?

•  Save a liele power, but double the area?

Razor: A Low-‐Power Pipeline Based on Circuit-‐Level Timing SpeculaNon

•  hep://www.eecs.umich.edu/~taus+n/papers/MICRO36-‐Razor.pdf

0 0.2 0.4 0.6 0.8 110-10

10-9

10-8

10-7

10-6

10-5

10-4Delay vs. Vdd

Supply Voltage (V)Q

uant

izer

Del

ay (s

)0 0.2 0.4 0.6 0.8 1

10-16

10-15

10-14

10-13

Supply Voltage (V)

Ener

gy (J

)

Energy/Conv-step vs. Vdd

TotalDynamicStatic

Low-Speed Large Circuit Parallelism

1:N Multiplexing High Energy-Efficiency

Near-‐VT Opera+on

Near-‐VT Design Margins •  Significant energy/performance lost to guardbands

Maxim

um

logic de

lay

Noise gua

rd-‐ban

d

Wearout

guard-‐ba

nd

Process v

ariaNo

n

guard-‐ba

nd

Clock Skew

an

d jiU

er

Maxim

um

logic de

lay

Noise gua

rd-‐ban

d

Wearout

guard-‐ba

nd

Process v

ariaNo

n

guard-‐ba

nd

Clock Skew

an

d jiU

er

Typical logic delay

Typical logic delay

Toda

y Future

Nme

Guardbands

Guardbands

Near-‐VT : The Variability Challenge

Motivating Example: Medical Electronics. Power consumption critically constrains med-ical devices, as the battery typically limits the system cost, safety, and lifetime [12]. For example,my collaborator Professor B. Chi at Tsinghua University (China) and I designed a swallowablewireless endoscope capsule [10,11]. Each endoscope System-on-a-Chip (SoC) incorporates a large-arrayed CMOS image sensor, an on-die digital signal processor, and a low-data-rate RF wirelesstransceiver. Unfortunately, the RF transceiver dissipates more than 90% of the total power budget.Because image sensing algorithms are highly parallel, high throughput computing performed locallywithin the endoscope itself would significantly reduce the transmitted RF downlink bandwidth andhence significantly improve battery life. Alternatively, it would enable us to partition more powerto the 2D-data acquisition in the image-sensing analog front-end.

In a more exotic, energy-constrained medical application, next-generation brain implants ex-ploit arrays of hundreds of electrodes [20, 21] to analyze nerve potentials, enabling new scientificunderstanding and possible neurological treatments. Limits on battery capacity and heat genera-tion [21,57] restrict brain implants to low power levels of several mW, but the implants still requiresignificant computational capabilities to reduce the wireless transmission power [22, 23]. Recentwork [56] suggests that with 1024 neural channels, local on-chip DSP algorithms such as featureextraction and clustering can improve total system power by factors of 4.3 and 26, respectively. Inaddition to the large amount of local DSP processing, future medical devices will also require largecomputational throughput for performing data encryption, compression, and wireless basebandprocessing.

!"#$%&'()&*+,(!-.,/#0(

(1"2((( (((((((((((((((((( ( ( ( ( ( ( ( ( ( (((((((((((132(

0

5e-09

1e-08

1.5e-08

2e-08

2.5e-08

3e-08

400 450 500 550 600 650

Dela

y [s]

Vdd [mV]

min

max

avg std

0 2e-09 4e-09 6e-09 8e-09 1e-08

1.2e-08 1.4e-08 1.6e-08

Dela

y [s]

4-5"$(102(

!44(6(789!(

(((((((!44(1:!2(

!

!

0.20 0.25 0.30 0.35 0.40 0.45 0.503

4

5

6

7

8

9

10

100

1000

EnergyF

req

uen

cy (

kH

z)

En

erg

y/In

st

(pJ)

Vdd

(V)

Vdd

=350m V,

3.52pJ/inst, 354kHz

Frequency

Figure 3: Subliminal processor frequency and energy

breakdowns at various supply voltages.

increase in a near-exponential fashion. This rise in leakage energy eventually dominates any reduction in switching energy, creating an energy minimum seen in!Figure 2.

The identification of an energy minimum has led to interest in processors that operate at this energy optimal supply voltage [12,14,17] (referred to as Vmin

and typically 250mV-350mV). However, the energy minimum is relatively shallow. Energy typically reduces by only ~2X when Vdd is scaled from the near-threshold regime (400-500mV) to the subthreshold regime, though delay rises by 50-100X over the same region. While acceptable in ultra-low energy sensor-based systems, this delay penalty is not tolerable in a broad set of applications. Hence, although introduced roughly 30 years ago, ultra-low voltage design remains confined to a small set of markets with little or no impact on mainstream semiconductor products.

3. NTC Analysis: Recent work at many leading institutions has produced working processors that operate at subthreshold voltages. For instance, the Subliminal processor [17] designed by Hanson et al. provides the opportunity to clearly quantify the NTC region and how it compares to the subthreshold region. Figure 3 presents the energy breakdown of the design as well as the operating frequency achieved across a range

of voltages. As was discussed in Section 2, there is a Vmin operating point that occurs in the subthreshold operating region but is tied to operating points of less than 1MHz. On the other hand, only a modest increase in energy is seen operating at the NTC region (around .5V), while frequency characteristics at that point are significantly better. At nominal operating points Subliminal operates at 20.5 MHz and 33.1 pJ/inst, showing approximately a 6.6x reduction in energy and an 11.4x reduction in frequency at the NTC operating point.

!

4. NTC Barriers: Although NTC provides for excellent energy-frequency tradeoffs, it doesn’t come without its own set of complications. NTC faces three key barriers that must be overcome for widespread use, performance loss, performance variation, and even functional failure. In the following subsections we will discuss why each of these exist and why they pose problems to the wide spread adoption of NTC.

4.1. Performance loss. The performance loss observed in NTC, while not as severe as that in subthreshold operation, poses one of the most formidable challenges for NTC viability. In an industrial 45nm technology the fanout-of-four inverter (FO4) delay at 400mV is 10X slower than at the nominal 1.1V. There have been several recent advances of architectural and circuit techniques that can be used to improve performance in the NTC regime. These techniques, described in detail in Section 5.1, center around aggressive parallelism with a novel NTC oriented memory/computation hierarchy. The increased communication needs in these architectures is supported by the application of 3D chip integration, as made feasible by the low power density of NTC circuits. In addition a new

Figure 2: Energy and delay in different supply voltage

operating regions.

Figure 1: (a) E�ect of lowered Vdd on the energy consumed/computation [13] and logic delay in0.13um-CMOS; (b) Box plot of the delay uncertainty across Monte Carlo variations while scalingsupply voltage, and for varying input values at Vdd=0.5V in 90nm-CMOS

1.2 Insight: Near-threshold operation saves energy, but has reliability limitations

In all of these examples, traditional low-power techniques cannot achieve the necessary e�ciencywhile still achieving the required performance. The microelectronics community has previouslyshown that sub- or near-threshold circuits achieve optimal energy and power performance wherethe dynamic energy and leakage energy dissipated per cycle are balanced [31]. This approach canimprove energy e⇥ciency by 10x to 20x, but only while running at very low frequencies of 0.1-10MHz, depending on the complexity of the circuit (Figure 1a). Furthermore, near-threshold designssu�er from very high process variability that can severely degrade performance and e⇥ciency by

3

16b-‐CLA, 90nm-‐CMOS, 0.5V In the NTV regime, it is extremely di�cult to insure

that these timing constraints will be met in the presenceof large process variations. Characterizing an entire pro-cessor and its worst-case delay path under Monte Carlo isusually out of the question. In [7], a monte carlo approx-imation is proposed to ensure correct timing under NTVvariations. These previous timing characterization meth-ods, however, may still be much di↵erent than the actualsilicon under test. For example, the 0.6V DSP in [7] waspre-silicon characterized at 14MHz after variations and ag-ing margins, but after fabrication, was able to successfullyoperate at 43.4MHz . With continued process scaling, guar-anteeing post-fabricated operation that correlates well withpre-silicon simulation is becoming more challenging, espe-cially for NTV operation.

3.2 Error DetectionA di↵erent solution to combating variations is in-situ er-

ror detection. After digital circuits are synthesized, an errordetection method is employed that detects if the timing con-straints are met during normal chip operation. Therefore,the design is not required to undergo simulation of every pos-sible variation-induced timing error that might occur underNTV.

The most common error detection method is Razor [6], orrelated timing speculative approaches [2]. With Razor, eachdatapath flip-flop is modified similar to the circuit shown inFigure 1. Each flop-flop is compared with the value stored ina ’shadow’ latch that is sampled by a delayed clock, typicallyaround 20% after the main clock edge. If the outputs havenot settled to their final result before the main clock edge,they will be caught in this delayed latch afterward. AfterXOR comparison, if the two values are di↵erent, an error isflagged.

Figure 1: Razor flip-flops with global error flag

These errors are then managed by an error recovery mech-anism to ensure that all instructions are executed correctlyafter an error. There are many di↵erent recovery meth-ods, with di↵erent recovery speeds and energy overheads [3].The clearest advantage of Razor is the potential for approx-imately a 20% speed improvement. Using this method al-lows for operation faster than the worst-case STA, improvingboth throughput and energy.

However, Razor circuits do exhibit several disadvantagesin the NTV regime. First o↵, the addition of an extra set oflatches to the datapath increases energy, significantly im-pacting any energy improvements due to increased clockspeed. Second, in order to guarantee that the 20% win-dow can correctly catch all errors, minimum delay bu↵ersmust be added to all min-delay fast paths in the logic, inorder to eliminate race conditions that prevent Razor fromoperating correctly. These bu↵ers add to the energy over-

5 10 15 20 25 30 350

10

20

30

40

Adder Delay (ns)

# of

Occ

urre

nces

Chosen Monte Carlo Case

(a)

5 10 15 20 25 30 350

50

100

150

Adder Delay (ns)

# of

Occ

urre

nces

Wor

st-c

ase

STA

Razo

rTA

CD

CSCD

(b)Figure 2: histograms of (a) monte carlo chip-to-chip delay of theSTA and (b) delay of changing FIR filter data on a 16-bit adderwith error-detection speeds marked.

head of Razor as well. Hence, while it is common to applyRazor and min-delay path insertion into only a subset of thelogic paths at super-threshold, at near-threshold, this maynot guarantee correct operation due to the unknown delaydistribution. The final disadvantage with Razor, specificallyin the NTV regime, is its inability to improve throughputbeyond approximately 20%, limited by the min-path raceconditions.1

4. VARIATION STUDYIn order to explore the e↵ects of process variations in the

NTV regime, a 16-bit Carry-Lookahead (CLA) adder wassynthesized in a 90nm IBM CMOS process. Figure 2a showsthe histogram of a 500-point Monte-Carlo simulation of theCLA adder, where the inputs are the worst-case static tim-ing analysis (STA) vectors determined by the synthesizer.The figure shows a large standard deviation in delay whileoperating at a NTV voltage of 500mV. In addition to processvariation-induced timing uncertainty is input vector varia-tion. Figure 2b shows the simulated delays of the same adderfor one particular Monte-Carlo simulation, where the inputvectors are supplied from the outputs of a FIR filter. Fur-ther, these vector-to-vector timing variations worsen as thecircuits are driven farther into the NTV regime.

Figure 3 shows the potential speedup that can be achievedwith the ability to ideally detect errors. For this simula-tion, we chose the worst-performing Monte Carlo case fromthe 500 that were simulated, resulting in a worst-case clockspeed of 32MHz (as opposed to a best-case speed of 166MHz ).Next, 1000 add-vectors from a low-pass FIR filter using elec-troencephalography (EEG) data were extracted from a Mat-lab simulation, and then simulated with the 16-bit carry-

1The theoretical maximum delay window for Razor is 50%.Unfortunately, NTV operation may exhibit delays that spanmuch larger than 50$.

500-point Monte-Carlo

Worst-Case STA

Single Worst Monte Carlo

Random Input Vectors

Razor Timing Specula+on [Ernst 04]

•  Popular mechanism to ‘speculate’ on logic comple+on •  Detects +ming viola+ons (typ: ~20%) •  Requires extra logic –  Inverter buffering of min-‐delay paths

Conven+onal Razor Scalar Processors •  Architecture: –  Typically op+mized for general-‐purpose control •  Single instruc+on/cycle •  Deep pipelining for ILP •  Global ERROR signal

•  Razor Circuits: –  Limited +ming specula+on •  Min-‐path race can be erroneous •  Avg. specula+on ~ 20-‐25% •  Max. specula+on = 50%

ARM-Michigan, ISSCC-2010

J. Low Power Electron. Appl. 2011, 1 342

3.1. Clock Gating

Clock Gating is conceptually the simplest technique to implement of all error recovery methods.Its original purpose was for saving power on unused blocks on a systems level by not clocking themwhen they are not used. This technique can also be adapted to error recovery by pausing all pipelinestages while waiting for the slow stage to either finish computation or to allow for the instruction tobe re-executed. The pausing action ensures that later instructions do not continue to their next pipelinestage until the errant instruction is corrected. It is most commonly paired with Razor flip-flops as it onlyworks if the pipeline can be stalled before the next clock edge, before the pipeline registers are set to getnew data which can be achieved in slow systems. The Clock Gating concept is illustrated in Figure 7.

Figure 7. (a) Pipeline modification for Clock Gating error recovery method; (b) ClockGating pipeline data path with errors.

IF FF ID FF EX FF MEM FF FF WB

clk

error error error

ST

error

PC

(a)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

IF I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I20 I21 I22

ID I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I11 I12 I13 I14 I15 I16 I17 I18 I19 I20 I21

EX I1 I2 I3 I4 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I17 I18 I19 I20

MEM I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I16 I17 I18 I19

ST I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I18

WB I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17

Error in EX of Inst. 4 Error in ID of Inst. 11 Error in 2 stages at once

ST

ST

ST

ST

ST

ST

ST

ST

ST

ST

ST

ST

ST

ST

(b)

The primary advantage to this method is that it requires very little architectural changes as well asminimal area addition to a design compared with other methods. However, in order for this method towork properly, a stall signal needs to propagate to all pipeline stages in a very short amount of time (50%of one clock cycle when Razor circuits are used). This can be difficult to achieve across large CMOSdies where pipeline stages are several millimeters apart. Furthermore, this is completely impractical toimplement in complicated microprocessors because it may take several clock cycles just to propagate theclock signal through a clock distribution network which cannot be halted in only one cycle.

3.2. Counterflow Pipelining

Traditional Counterflow Pipelining is a microarchitecture technique that uses a bidirectional pipeline,allowing instructions to flow forward and results to flow backward. This technique made it easier toimplement operand forward, register renaming, and most importantly pipeline flushing [13,17]. In order

VDD 0.9-1.1V fCLK = 1GHz in 65nm

Data-‐Parallel SIMD Architectures •  SIMD: single-‐instruc+on, mul+ple-‐data •  Mul+ple execu+on units share

sequencer and control path

•  Sharing control path amor+zes its energy overhead among lanes

•  GPU is a popular example of SIMD

Timing Errors in SIMD Architecture

STALL!

Error Impact on SIMD Architecture

0.50.60.70.80.91

0 0.02 0.04 0.06 0.08 0.1

Fractio

n of peak

throug

hput

Probability of an error in a single stage, single lane

SISD16-‐wide SIMD32-‐wide SIMD

Proposed 10-‐lane SIMD Architecture

IQ ControlStall IQ

0

0

Next InstDQFull Bar [9:0] Bar

Stall [9:0]

lw (barrier)umul $r1, $r2

Stall DQLane 9

...

...

...

xor $r7, $r4msub $r2, $r3

lw (barrier)umul $r1, $r2

add $r9, $r7

0

RAZOR

3:1

RF

ALU

3:1

IQ_out [2]IQ_out [3]

ALU_out [1]ALU_out [2]

lw (barrier)

RAZOR

3:1

RF

ALU

3:1

NextInst Full Bar[1] Pass Next

Inst Full Bar[9] Pass

RAZOR

3:1

RF

ALU

3:1

IQ_out [9]IQ_out [8]

ALU_out [9]ALU_out [8]

0

RF_out [1]RF_out [2]

RF_out [8]RF_out [9]

Stall DQLane 1

Stall DQLane 0

ALU_wb [0] ALU_wb [1] ALU_wb [9]

NextInst Full Bar[0] Pass

RAZOR RAZOR RAZOR

WBALU/RF Weave

1 cycle

InstructionFetch

1 cycle

IQ WeaveRF Access

1 cycle

Execute

2 cycles

DecouplingQueues

1 cycle

Synctium Pipeline

and $r9, $r1teq $r8, $r5

msub $r2, $r3xor $r7, $r4

umul $r1, $r2

add $r9, $r7

madd $r1, $r3

add $r5, $r6or $r3, $r7

. . .mul $r2, $r1

sll $r4, $r2

InstructionQueue

0

127

Decoupled Parallel SIMD Pipeline

STALL!

Decoupling synchronous ‘async’ between lanes!

Decoupled Parallel SIMD Pipeline (DPSP)

0.50.60.70.80.91

0 0.02 0.04 0.06 0.08 0.1

Fractio

n of peak

throug

hput

Probability of an error in a single stage, single lane

SISD32-‐wide SIMD32-‐wide DPSP

Delay Characteriza+on

0.53 0.6 0.7 0.8 1.0

23456789

101112

VDD (V)

Del

ay (n

s)

4

6

8

10

10-point 100-pointD

elay

(ns)

Variation vs. Sample Size(Lane 1, 0.53V)

Lane Weaving - Eliminating lanes improves total throughput

0 1 2 3 4 5 6 7 8 950

100

150

200

250

300

350

400

450

500

Lane

Max

Ope

ratin

g Fr

eque

ncy

(MH

z)

35% Frequency Δ

6% Frequency Δ

2% Frequency Δ

9% Frequency Δ

8% Frequency Δ

0 1 2 3 4 5

0.53

0.60

0.70

0.80

1.0

Throughput (GOPS)

9 Lanes, 1 Spare10 Lanes, 0 Spares

8 Lanes, 2 Spares

4.35

3.563.96

3.25

2.762.99

1.55

1.361.46

0.85

0.920.86

2.53

2.182.32

VDD = 0.53V VDD = 0.6V VDD = 0.7VVDD = 0.8V VDD = 1.0V

V DD(V

)

Imagine

SPI Storm 1

Subliminal

Phoenix

SODA(90nm) AnySP‡

TILE64

Inteli5-67

ARMCortex A9

ATI HD 5970

103

106

109

1012

10-9 10-6 10-3 1 102

Scale VTIntel SIMD accelerator TI

LP DSP

Peak

Per

form

ance

[O

P†/s

ec]

Power [W]

VDD=1VVDD=0.6V

VDD=0.53V

Sub-‐Threshold Low-‐Performance

Super-‐Threshold High-‐Performance

Near-‐Threshold Medium-‐Performance

1st Near-VT Parallel-SIMD Processor with Variation Resiliency

Documents

Electrical Engineering and Computer Science | | Oregon ......ods, however, may still be much di↵erent than the actual silicon under test. For example, the 0.6V DSP in [7] was pre-silicon