Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Power Reduc+on Techniques
• Stop the clock – Dynamic power reduc+on
• Power ga+ng – Reduce the leakage
• How fast can you turn something on/off? – Nothing to do à sleep
• How can you save power while in opera+on? – Near-‐threshold design
What is the overhead of shuEng-‐down? Versus just leEng it leak?
Power Ga+ng
Kevin Nowka, IBM
Gate Leakage
Digital Paralleliza+on Y[n] = X[n] + αX[n-‐1]
Input (5bits @ 5GS/s)
clk clk
X[n] X[n-‐1]
Y[n] +
x
α
Clk = 5GHz
Analog Signal
Input (5bits @ 5GS/s)
Or
(8bits @ 100MHz)
ANALOG DIGITAL
DSP Paralleliza+on Y[n] = X[n] + αX[n-‐1]
Input (5bits @ 5GS/s)
clk
clk
X[n] X[n-‐2]
+
x
α
Y[n-‐1] = X[n-‐1] + αX[n-‐2]
clk
clkb
CLK = 5GHz
clk
X[n-‐1]
Y[n]
Y[n-‐1] +
x
CLK = 2.5GHz
α
DSP Paralleliza+on • Clock speed reduced by ½ – Can parallelize further – Increase number of MACs(mul+ply/accumulates) by 2
• Intui+on? – Area goes up by 2 – Power decreases (clock rate down by 2, computa+ons up by 2, but easier +ming constraints)
– What about clock power?
• Save a liele power, but double the area?
Razor: A Low-‐Power Pipeline Based on Circuit-‐Level Timing SpeculaNon
• hep://www.eecs.umich.edu/~taus+n/papers/MICRO36-‐Razor.pdf
0 0.2 0.4 0.6 0.8 110-10
10-9
10-8
10-7
10-6
10-5
10-4Delay vs. Vdd
Supply Voltage (V)Q
uant
izer
Del
ay (s
)0 0.2 0.4 0.6 0.8 1
10-16
10-15
10-14
10-13
Supply Voltage (V)
Ener
gy (J
)
Energy/Conv-step vs. Vdd
TotalDynamicStatic
Low-Speed Large Circuit Parallelism
1:N Multiplexing High Energy-Efficiency
Near-‐VT Opera+on
Near-‐VT Design Margins • Significant energy/performance lost to guardbands
Maxim
um
logic de
lay
Noise gua
rd-‐ban
d
Wearout
guard-‐ba
nd
Process v
ariaNo
n
guard-‐ba
nd
Clock Skew
an
d jiU
er
Maxim
um
logic de
lay
Noise gua
rd-‐ban
d
Wearout
guard-‐ba
nd
Process v
ariaNo
n
guard-‐ba
nd
Clock Skew
an
d jiU
er
Typical logic delay
Typical logic delay
Toda
y Future
Nme
Guardbands
Guardbands
Near-‐VT : The Variability Challenge
Motivating Example: Medical Electronics. Power consumption critically constrains med-ical devices, as the battery typically limits the system cost, safety, and lifetime [12]. For example,my collaborator Professor B. Chi at Tsinghua University (China) and I designed a swallowablewireless endoscope capsule [10,11]. Each endoscope System-on-a-Chip (SoC) incorporates a large-arrayed CMOS image sensor, an on-die digital signal processor, and a low-data-rate RF wirelesstransceiver. Unfortunately, the RF transceiver dissipates more than 90% of the total power budget.Because image sensing algorithms are highly parallel, high throughput computing performed locallywithin the endoscope itself would significantly reduce the transmitted RF downlink bandwidth andhence significantly improve battery life. Alternatively, it would enable us to partition more powerto the 2D-data acquisition in the image-sensing analog front-end.
In a more exotic, energy-constrained medical application, next-generation brain implants ex-ploit arrays of hundreds of electrodes [20, 21] to analyze nerve potentials, enabling new scientificunderstanding and possible neurological treatments. Limits on battery capacity and heat genera-tion [21,57] restrict brain implants to low power levels of several mW, but the implants still requiresignificant computational capabilities to reduce the wireless transmission power [22, 23]. Recentwork [56] suggests that with 1024 neural channels, local on-chip DSP algorithms such as featureextraction and clustering can improve total system power by factors of 4.3 and 26, respectively. Inaddition to the large amount of local DSP processing, future medical devices will also require largecomputational throughput for performing data encryption, compression, and wireless basebandprocessing.
!"#$%&'()&*+,(!-.,/#0(
(1"2((( (((((((((((((((((( ( ( ( ( ( ( ( ( ( (((((((((((132(
0
5e-09
1e-08
1.5e-08
2e-08
2.5e-08
3e-08
400 450 500 550 600 650
Dela
y [s]
Vdd [mV]
min
max
avg std
0 2e-09 4e-09 6e-09 8e-09 1e-08
1.2e-08 1.4e-08 1.6e-08
Dela
y [s]
4-5"$(102(
!44(6(789!(
(((((((!44(1:!2(
!
!
0.20 0.25 0.30 0.35 0.40 0.45 0.503
4
5
6
7
8
9
10
100
1000
EnergyF
req
uen
cy (
kH
z)
En
erg
y/In
st
(pJ)
Vdd
(V)
Vdd
=350m V,
3.52pJ/inst, 354kHz
Frequency
Figure 3: Subliminal processor frequency and energy
breakdowns at various supply voltages.
increase in a near-exponential fashion. This rise in leakage energy eventually dominates any reduction in switching energy, creating an energy minimum seen in!Figure 2.
The identification of an energy minimum has led to interest in processors that operate at this energy optimal supply voltage [12,14,17] (referred to as Vmin
and typically 250mV-350mV). However, the energy minimum is relatively shallow. Energy typically reduces by only ~2X when Vdd is scaled from the near-threshold regime (400-500mV) to the subthreshold regime, though delay rises by 50-100X over the same region. While acceptable in ultra-low energy sensor-based systems, this delay penalty is not tolerable in a broad set of applications. Hence, although introduced roughly 30 years ago, ultra-low voltage design remains confined to a small set of markets with little or no impact on mainstream semiconductor products.
3. NTC Analysis: Recent work at many leading institutions has produced working processors that operate at subthreshold voltages. For instance, the Subliminal processor [17] designed by Hanson et al. provides the opportunity to clearly quantify the NTC region and how it compares to the subthreshold region. Figure 3 presents the energy breakdown of the design as well as the operating frequency achieved across a range
of voltages. As was discussed in Section 2, there is a Vmin operating point that occurs in the subthreshold operating region but is tied to operating points of less than 1MHz. On the other hand, only a modest increase in energy is seen operating at the NTC region (around .5V), while frequency characteristics at that point are significantly better. At nominal operating points Subliminal operates at 20.5 MHz and 33.1 pJ/inst, showing approximately a 6.6x reduction in energy and an 11.4x reduction in frequency at the NTC operating point.
!
4. NTC Barriers: Although NTC provides for excellent energy-frequency tradeoffs, it doesn’t come without its own set of complications. NTC faces three key barriers that must be overcome for widespread use, performance loss, performance variation, and even functional failure. In the following subsections we will discuss why each of these exist and why they pose problems to the wide spread adoption of NTC.
4.1. Performance loss. The performance loss observed in NTC, while not as severe as that in subthreshold operation, poses one of the most formidable challenges for NTC viability. In an industrial 45nm technology the fanout-of-four inverter (FO4) delay at 400mV is 10X slower than at the nominal 1.1V. There have been several recent advances of architectural and circuit techniques that can be used to improve performance in the NTC regime. These techniques, described in detail in Section 5.1, center around aggressive parallelism with a novel NTC oriented memory/computation hierarchy. The increased communication needs in these architectures is supported by the application of 3D chip integration, as made feasible by the low power density of NTC circuits. In addition a new
Figure 2: Energy and delay in different supply voltage
operating regions.
Figure 1: (a) E�ect of lowered Vdd on the energy consumed/computation [13] and logic delay in0.13um-CMOS; (b) Box plot of the delay uncertainty across Monte Carlo variations while scalingsupply voltage, and for varying input values at Vdd=0.5V in 90nm-CMOS
1.2 Insight: Near-threshold operation saves energy, but has reliability limitations
In all of these examples, traditional low-power techniques cannot achieve the necessary e�ciencywhile still achieving the required performance. The microelectronics community has previouslyshown that sub- or near-threshold circuits achieve optimal energy and power performance wherethe dynamic energy and leakage energy dissipated per cycle are balanced [31]. This approach canimprove energy e⇥ciency by 10x to 20x, but only while running at very low frequencies of 0.1-10MHz, depending on the complexity of the circuit (Figure 1a). Furthermore, near-threshold designssu�er from very high process variability that can severely degrade performance and e⇥ciency by
3
16b-‐CLA, 90nm-‐CMOS, 0.5V In the NTV regime, it is extremely di�cult to insure
that these timing constraints will be met in the presenceof large process variations. Characterizing an entire pro-cessor and its worst-case delay path under Monte Carlo isusually out of the question. In [7], a monte carlo approx-imation is proposed to ensure correct timing under NTVvariations. These previous timing characterization meth-ods, however, may still be much di↵erent than the actualsilicon under test. For example, the 0.6V DSP in [7] waspre-silicon characterized at 14MHz after variations and ag-ing margins, but after fabrication, was able to successfullyoperate at 43.4MHz . With continued process scaling, guar-anteeing post-fabricated operation that correlates well withpre-silicon simulation is becoming more challenging, espe-cially for NTV operation.
3.2 Error DetectionA di↵erent solution to combating variations is in-situ er-
ror detection. After digital circuits are synthesized, an errordetection method is employed that detects if the timing con-straints are met during normal chip operation. Therefore,the design is not required to undergo simulation of every pos-sible variation-induced timing error that might occur underNTV.
The most common error detection method is Razor [6], orrelated timing speculative approaches [2]. With Razor, eachdatapath flip-flop is modified similar to the circuit shown inFigure 1. Each flop-flop is compared with the value stored ina ’shadow’ latch that is sampled by a delayed clock, typicallyaround 20% after the main clock edge. If the outputs havenot settled to their final result before the main clock edge,they will be caught in this delayed latch afterward. AfterXOR comparison, if the two values are di↵erent, an error isflagged.
Figure 1: Razor flip-flops with global error flag
These errors are then managed by an error recovery mech-anism to ensure that all instructions are executed correctlyafter an error. There are many di↵erent recovery meth-ods, with di↵erent recovery speeds and energy overheads [3].The clearest advantage of Razor is the potential for approx-imately a 20% speed improvement. Using this method al-lows for operation faster than the worst-case STA, improvingboth throughput and energy.
However, Razor circuits do exhibit several disadvantagesin the NTV regime. First o↵, the addition of an extra set oflatches to the datapath increases energy, significantly im-pacting any energy improvements due to increased clockspeed. Second, in order to guarantee that the 20% win-dow can correctly catch all errors, minimum delay bu↵ersmust be added to all min-delay fast paths in the logic, inorder to eliminate race conditions that prevent Razor fromoperating correctly. These bu↵ers add to the energy over-
5 10 15 20 25 30 350
10
20
30
40
Adder Delay (ns)
# of
Occ
urre
nces
Chosen Monte Carlo Case
(a)
5 10 15 20 25 30 350
50
100
150
Adder Delay (ns)
# of
Occ
urre
nces
Wor
st-c
ase
STA
Razo
rTA
CD
CSCD
(b)Figure 2: histograms of (a) monte carlo chip-to-chip delay of theSTA and (b) delay of changing FIR filter data on a 16-bit adderwith error-detection speeds marked.
head of Razor as well. Hence, while it is common to applyRazor and min-delay path insertion into only a subset of thelogic paths at super-threshold, at near-threshold, this maynot guarantee correct operation due to the unknown delaydistribution. The final disadvantage with Razor, specificallyin the NTV regime, is its inability to improve throughputbeyond approximately 20%, limited by the min-path raceconditions.1
4. VARIATION STUDYIn order to explore the e↵ects of process variations in the
NTV regime, a 16-bit Carry-Lookahead (CLA) adder wassynthesized in a 90nm IBM CMOS process. Figure 2a showsthe histogram of a 500-point Monte-Carlo simulation of theCLA adder, where the inputs are the worst-case static tim-ing analysis (STA) vectors determined by the synthesizer.The figure shows a large standard deviation in delay whileoperating at a NTV voltage of 500mV. In addition to processvariation-induced timing uncertainty is input vector varia-tion. Figure 2b shows the simulated delays of the same adderfor one particular Monte-Carlo simulation, where the inputvectors are supplied from the outputs of a FIR filter. Fur-ther, these vector-to-vector timing variations worsen as thecircuits are driven farther into the NTV regime.
Figure 3 shows the potential speedup that can be achievedwith the ability to ideally detect errors. For this simula-tion, we chose the worst-performing Monte Carlo case fromthe 500 that were simulated, resulting in a worst-case clockspeed of 32MHz (as opposed to a best-case speed of 166MHz ).Next, 1000 add-vectors from a low-pass FIR filter using elec-troencephalography (EEG) data were extracted from a Mat-lab simulation, and then simulated with the 16-bit carry-
1The theoretical maximum delay window for Razor is 50%.Unfortunately, NTV operation may exhibit delays that spanmuch larger than 50$.
500-point Monte-Carlo
Worst-Case STA
Single Worst Monte Carlo
Random Input Vectors
Razor Timing Specula+on [Ernst 04]
• Popular mechanism to ‘speculate’ on logic comple+on • Detects +ming viola+ons (typ: ~20%) • Requires extra logic – Inverter buffering of min-‐delay paths
Conven+onal Razor Scalar Processors • Architecture: – Typically op+mized for general-‐purpose control • Single instruc+on/cycle • Deep pipelining for ILP • Global ERROR signal
• Razor Circuits: – Limited +ming specula+on • Min-‐path race can be erroneous • Avg. specula+on ~ 20-‐25% • Max. specula+on = 50%
ARM-Michigan, ISSCC-2010
J. Low Power Electron. Appl. 2011, 1 342
3.1. Clock Gating
Clock Gating is conceptually the simplest technique to implement of all error recovery methods.Its original purpose was for saving power on unused blocks on a systems level by not clocking themwhen they are not used. This technique can also be adapted to error recovery by pausing all pipelinestages while waiting for the slow stage to either finish computation or to allow for the instruction tobe re-executed. The pausing action ensures that later instructions do not continue to their next pipelinestage until the errant instruction is corrected. It is most commonly paired with Razor flip-flops as it onlyworks if the pipeline can be stalled before the next clock edge, before the pipeline registers are set to getnew data which can be achieved in slow systems. The Clock Gating concept is illustrated in Figure 7.
Figure 7. (a) Pipeline modification for Clock Gating error recovery method; (b) ClockGating pipeline data path with errors.
IF FF ID FF EX FF MEM FF FF WB
clk
error error error
ST
error
PC
(a)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
IF I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19 I20 I21 I22
ID I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I11 I12 I13 I14 I15 I16 I17 I18 I19 I20 I21
EX I1 I2 I3 I4 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I17 I18 I19 I20
MEM I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I16 I17 I18 I19
ST I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I18
WB I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17
Error in EX of Inst. 4 Error in ID of Inst. 11 Error in 2 stages at once
ST
ST
ST
ST
ST
ST
ST
ST
ST
ST
ST
ST
ST
ST
(b)
The primary advantage to this method is that it requires very little architectural changes as well asminimal area addition to a design compared with other methods. However, in order for this method towork properly, a stall signal needs to propagate to all pipeline stages in a very short amount of time (50%of one clock cycle when Razor circuits are used). This can be difficult to achieve across large CMOSdies where pipeline stages are several millimeters apart. Furthermore, this is completely impractical toimplement in complicated microprocessors because it may take several clock cycles just to propagate theclock signal through a clock distribution network which cannot be halted in only one cycle.
3.2. Counterflow Pipelining
Traditional Counterflow Pipelining is a microarchitecture technique that uses a bidirectional pipeline,allowing instructions to flow forward and results to flow backward. This technique made it easier toimplement operand forward, register renaming, and most importantly pipeline flushing [13,17]. In order
VDD 0.9-1.1V fCLK = 1GHz in 65nm
Data-‐Parallel SIMD Architectures • SIMD: single-‐instruc+on, mul+ple-‐data • Mul+ple execu+on units share
sequencer and control path
• Sharing control path amor+zes its energy overhead among lanes
• GPU is a popular example of SIMD
Timing Errors in SIMD Architecture
STALL!
Error Impact on SIMD Architecture
0.50.60.70.80.91
0 0.02 0.04 0.06 0.08 0.1
Fractio
n of peak
throug
hput
Probability of an error in a single stage, single lane
SISD16-‐wide SIMD32-‐wide SIMD
Proposed 10-‐lane SIMD Architecture
IQ ControlStall IQ
0
0
Next InstDQFull Bar [9:0] Bar
Stall [9:0]
lw (barrier)umul $r1, $r2
Stall DQLane 9
...
...
...
xor $r7, $r4msub $r2, $r3
lw (barrier)umul $r1, $r2
add $r9, $r7
0
RAZOR
3:1
RF
ALU
3:1
IQ_out [2]IQ_out [3]
ALU_out [1]ALU_out [2]
lw (barrier)
RAZOR
3:1
RF
ALU
3:1
NextInst Full Bar[1] Pass Next
Inst Full Bar[9] Pass
RAZOR
3:1
RF
ALU
3:1
IQ_out [9]IQ_out [8]
ALU_out [9]ALU_out [8]
0
RF_out [1]RF_out [2]
RF_out [8]RF_out [9]
Stall DQLane 1
Stall DQLane 0
ALU_wb [0] ALU_wb [1] ALU_wb [9]
NextInst Full Bar[0] Pass
RAZOR RAZOR RAZOR
WBALU/RF Weave
1 cycle
InstructionFetch
1 cycle
IQ WeaveRF Access
1 cycle
Execute
2 cycles
DecouplingQueues
1 cycle
Synctium Pipeline
and $r9, $r1teq $r8, $r5
msub $r2, $r3xor $r7, $r4
umul $r1, $r2
add $r9, $r7
madd $r1, $r3
add $r5, $r6or $r3, $r7
. . .mul $r2, $r1
sll $r4, $r2
InstructionQueue
0
127
Decoupled Parallel SIMD Pipeline
STALL!
Decoupling synchronous ‘async’ between lanes!
Decoupled Parallel SIMD Pipeline (DPSP)
0.50.60.70.80.91
0 0.02 0.04 0.06 0.08 0.1
Fractio
n of peak
throug
hput
Probability of an error in a single stage, single lane
SISD32-‐wide SIMD32-‐wide DPSP
Delay Characteriza+on
0.53 0.6 0.7 0.8 1.0
23456789
101112
VDD (V)
Del
ay (n
s)
4
6
8
10
10-point 100-pointD
elay
(ns)
Variation vs. Sample Size(Lane 1, 0.53V)
Lane Weaving - Eliminating lanes improves total throughput
0 1 2 3 4 5 6 7 8 950
100
150
200
250
300
350
400
450
500
Lane
Max
Ope
ratin
g Fr
eque
ncy
(MH
z)
35% Frequency Δ
6% Frequency Δ
2% Frequency Δ
9% Frequency Δ
8% Frequency Δ
0 1 2 3 4 5
0.53
0.60
0.70
0.80
1.0
Throughput (GOPS)
9 Lanes, 1 Spare10 Lanes, 0 Spares
8 Lanes, 2 Spares
4.35
3.563.96
3.25
2.762.99
1.55
1.361.46
0.85
0.920.86
2.53
2.182.32
VDD = 0.53V VDD = 0.6V VDD = 0.7VVDD = 0.8V VDD = 1.0V
V DD(V
)
Imagine
SPI Storm 1
Subliminal
Phoenix
SODA(90nm) AnySP‡
TILE64
Inteli5-67
ARMCortex A9
ATI HD 5970
103
106
109
1012
10-9 10-6 10-3 1 102
Scale VTIntel SIMD accelerator TI
LP DSP
Peak
Per
form
ance
[O
P†/s
ec]
Power [W]
VDD=1VVDD=0.6V
VDD=0.53V
Sub-‐Threshold Low-‐Performance
Super-‐Threshold High-‐Performance
Near-‐Threshold Medium-‐Performance
1st Near-VT Parallel-SIMD Processor with Variation Resiliency