Upload
sakthi-velan
View
223
Download
0
Embed Size (px)
Citation preview
8/13/2019 LOW POWER DESIGN METHODODLOGIES
1/52
ELEN 468 Lecture 29 1
ELEN 468Advanced Logic Design
Lecture 29
Low Power Design
8/13/2019 LOW POWER DESIGN METHODODLOGIES
2/52
ELEN 468 Lecture 29 2
Power Dissipation
P6Pentium proc
486
3862868086
80858080
80084004
0.1
1
10
100
1971 1974 1978 1985 1992 2000
Year
Power(W
atts)
Power increases despite Vdd decrease
Courtesy, Intel
8/13/2019 LOW POWER DESIGN METHODODLOGIES
3/52
ELEN 468 Lecture 29 3
Power Density
4004
8008
8080
8085
8086
286386
486Pentium proc
P6
1
10
100
1000
10000
1970 1980 1990 2000 2010
Year
PowerDensity(W
/cm2)
Hot Plate
Nuclear
Reactor
RocketNozzle
Courtesy, Intel
8/13/2019 LOW POWER DESIGN METHODODLOGIES
4/52
ELEN 468 Lecture 29 4
Why Power Increased
Growing die size, fast frequency scaling
Clock Frequency (MHz)
10
100
1000
10000
85 87 89 91 93 95 97 99 01 03 05
8/13/2019 LOW POWER DESIGN METHODODLOGIES
5/52
ELEN 468 Lecture 29 5
Gate Power Dissipation
Leakage power
Dynamic power
Short circuit power
8/13/2019 LOW POWER DESIGN METHODODLOGIES
6/52
ELEN 468 Lecture 29 6
Dynamic Power
Occurs at eachswitching
Pd= CLVdd2fpfp switching
frequency
out
Vdd
out
Vdd
SaturationLinear
8/13/2019 LOW POWER DESIGN METHODODLOGIES
7/52
ELEN 468 Lecture 29 7
Leakage Power
Static
Leakage current
= a VddLeakage current
= b/Vt
Killer to CMOStechnology
out
Vdd
out
Vdd
SaturationLinear
Leakage
Leakage
8/13/2019 LOW POWER DESIGN METHODODLOGIES
8/52
ELEN 468 Lecture 29 8
Short Circuit Power
During switching,there is a short
moment when bothPMOS and CMOS arepartially on
Ps= Q(Vdd-Vt)3
trfptrrising time
out
Vdd
out
Vdd
Input rising
Input falling
8/13/2019 LOW POWER DESIGN METHODODLOGIES
9/52
ELEN 468 Lecture 29 9
Where Does Power Go?
Power percentages
Core transistor
leakage
Gate leakageCache leakage
Active power
0%
10%
20%
30%
40%
50%
60%70%
80%
90%
100%
Scalable X86 CPU Design for 90nm
Low VTdevices are
8/13/2019 LOW POWER DESIGN METHODODLOGIES
10/52
ELEN 468 Lecture 29 10
EnergyPerformance Space
Every design is a point on a 2-D plane
Performance
Energy
8/13/2019 LOW POWER DESIGN METHODODLOGIES
11/52
ELEN 468 Lecture 29 11
Low Power Design
Reduce dynamic power
a: clock gating, sleep mode
C: small transistors (esp. on clock), short wires
VDD: lowest suitable voltage f: lowest suitable frequency
Reduce static power
Selectively use low Vt
devices
Power gating, MTCMOS
Stacked devices
Body bias
8/13/2019 LOW POWER DESIGN METHODODLOGIES
12/52
ELEN 468 Lecture 29 12
Clock Gating
Gate off clock to idle functionalunits e.g., floating point units
need logic to generatedisable signal increases complexity of control logic consumes power timing critical to avoid clock glitches
at OR gate output
additional gate delay on clocksignal
gating OR gate can replace a buffer inthe clock distribution tree
R
e
g
clock
disable
Functional
unit
8/13/2019 LOW POWER DESIGN METHODODLOGIES
13/52
ELEN 468 Lecture 29 13
Active Power Reduction - SupplyVoltage Reduction
Static Dynamic
Pros:
Always active in saving
Cons:
Additional power delivery networkNeeds special care of interface between
power domains
signals close to Vtexcessive leakage
and reduced noise margins
Adjusting operation voltage and frequency to
performance requirements:
High performancehigh Vdd& frequencyPower savinglow Vdd& frequency
Pros:
Doesnt limit performance
Cons:
Penalty of transition between differentpower states can be high (in performance
and power)
Additional control logic
Slow SlowFast
High
Supply
Voltage
Low
Supply
Voltage
8/13/2019 LOW POWER DESIGN METHODODLOGIES
14/52
ELEN 468 Lecture 29 14
Voltage Islands (Multi-Vdd)
Allow both macro and cell voltage assignmentAllow different voltage islands in the same circuit row
Lift unnatural layout restrictions
Minimal placement disturbance
Lackey+
ICCAD02
Usami+
JSSC98
Vddh
Vddl
GVI
DAC03
8/13/2019 LOW POWER DESIGN METHODODLOGIES
15/52
ELEN 468 Lecture 29 15
Level Converter
Interface circuit when Vddldrives Vddhto avoid leakage
VddH
VddL
weak on!
Vddh
Vddl
IN
OUT
Conventional dual
supply level converter
Vddh
IN
OUT
New single supply level
converter
8/13/2019 LOW POWER DESIGN METHODODLOGIES
16/52
ELEN 468 Lecture 29 16
Adjacency Metrics for Clustering
Logic adjacency metric (LAM):Vddl fanin cone oflevel shifter without going through Vddh
LC1
Vddh
Vddl
LC2
LC3
Vddh
Vddl
LC2
LC3
Physical adjacency metric (PAM):for each candidateVddlcell, compute total size of its neighbor Vddlcells
LAM to guide logic aware voltage assignment
PAM to guide placement aware voltage re-assignment
8/13/2019 LOW POWER DESIGN METHODODLOGIES
17/52
ELEN 468 Lecture 29 17
Level Converter Optimizations
Logic replacement (or gate sizing)
ZMUX1
LC
LC
LCLC
DEC
ZMUX2
DEC
B A B ALC LC
LC/Buffer co-optimization
8/13/2019 LOW POWER DESIGN METHODODLOGIES
18/52
ELEN 468 Lecture 29 18
Placement to Form Voltage Islandswith Power Grid Co-design
Based on Vddl and Vddh
cell placement after
voltage assignment,
define Vddl/Vddhpowergrids on demand
Detailed placement to
form Vddl/Vddhvoltageislands that can hit their
corresponding power
supplies
Vddh
Power grids on demand
Vddl Vddh Vddl Vddh Vddl Vddh
Vddl
8/13/2019 LOW POWER DESIGN METHODODLOGIES
19/52
ELEN 468 Lecture 29 19
Example of Voltage Islands
Vddl=
1.2V
Vddh
= 1.5V
No timing degradation, no area increase!
-IBM Cu11
-0.13um
- 400 MHz
(courtesy IBM)
8/13/2019 LOW POWER DESIGN METHODODLOGIES
20/52
ELEN 468 Lecture 29 20
Dynamic Frequency andVoltage Scaling
Always run at the lowest supply voltage that meets the timingconstraints
DFS (dynamic frequency scaling) saves only power
DVS (dynamic voltage scaling) + DFS saves both energy and power
A DVS+DFS system requires the following A programmable clock generator (PLL)
PLL from 200MHz 700MHz in increments of 33MHz
A supply regulation loop that sets the minimum VDDnecessary foroperation at the desired frequency
32 levels of VDDfrom 1.1V to 1.6V
An operating system that sets the required frequency + supply voltageto meet the task completion deadlines heavier load ramp up VDD, when stable speed up clock lighter load slow down clock, when PLL locks onto new rate, ramp down
VDD
8/13/2019 LOW POWER DESIGN METHODODLOGIES
21/52
ELEN 468 Lecture 29 22
Leakage Reduction Techniques
pullup (Vdd)
Vx
stack effect
Wu
Wl
High Vtdevices
Low Vtdevices
dual Vt
partitioning
VnwellVdd
Vpwell 0
variable threshold
(VTCMOS)
low Vtlogic
sleep
sleep
Vdd
virtual Vdd
HVT
virtual Gnd
multi-threshold
(MTCMOS)
HVT
Vdd
8/13/2019 LOW POWER DESIGN METHODODLOGIES
22/52
ELEN 468 Lecture 29 23
Natural Transistor Stacks
Reduce the leakage by stacking the devices
Reduced Vds
Negative Vgs
Negative Vbs
How?
8/13/2019 LOW POWER DESIGN METHODODLOGIES
23/52
ELEN 468 Lecture 29 24
Design with Dual Vth
Dual Vthdesign
Two flavors of transistors: slowhigh Vth, fastlow Vth Low Vthare faster, but have 10X leakage
Dual Vthevaluation
8/13/2019 LOW POWER DESIGN METHODODLOGIES
24/52
ELEN 468 Lecture 29 25
Impacts of Variable VT
Reducing the VTincreasesthe sub-threshold leakage current (exponentially)
VT
= VT0
+ ( F
+ VSB
- F
)
where VT0 is the threshold voltage at VSB=0, VSBis the source- bulk (substrate)voltage, is the body-effect coefficient
But, reducing VTdecreasesgate delay(increases performance)
8/13/2019 LOW POWER DESIGN METHODODLOGIES
25/52
8/13/2019 LOW POWER DESIGN METHODODLOGIES
26/52
ELEN 468 Lecture 29 27
Forward/Reverse Body Biasing
RBB (Reverse Body Bias):zerobody bias in active mode, a deep
reverse bias in standby mode.
FBB (Forward Body Bias):high Vth instandby mode, forward body biasing to
achieve better current drive in active mode.
Disadvantages:Increase PN junction reverse
leakage
Scaling down technology worsen
short channel effects and weaken
the Vth modulation capability
Disadvantages:Larger junction capacitance
High body effect for stack devices
8/13/2019 LOW POWER DESIGN METHODODLOGIES
27/52
ELEN 468 Lecture 29 28
Implementation of Dynamic Vth Scaling
(DTS)
The lowest Vth is delivered (NBB-no body bias) if the highest
performance is required.
When the performance demand is low, clock frequency is lowered
and Vth is raised via RBB to reduce the run time leakage power
dissipation.
How?When critical path replica frequency is less then reference CLK,
adjust bias to decrease Vth.
Otherwise adjust bias to increase Vth.
Results:
8/13/2019 LOW POWER DESIGN METHODODLOGIES
28/52
ELEN 468 Lecture 29 29
Power Gating Using Sleep Transistors
Or can reduce leakage bygating the supply rails whenthe circuit is in sleep mode
in normal mode, sleep = 0 and
the sleep transistors mustpresent as small a resistance aspossible (via sizing)
in sleep mode, sleep = 1, thetransistor stack effect reduces
leakage by orders of magnitude
Or can eliminateleakage by switching off the powersupply (but lose the memory state)
8/13/2019 LOW POWER DESIGN METHODODLOGIES
29/52
ELEN 468 Lecture 29 30
Example of Power Gating
Embedded
PowerSwitches
Rows of
Standard
Cells
Power Switch
Control Signals
Can reduce power1000X
Smaller voltage swing(IR drop on sleep
transistors) Lower performance
Increased noisecoupling
Local power griddesign
8/13/2019 LOW POWER DESIGN METHODODLOGIES
30/52
ELEN 468 Lecture 29 31
Power Dissipation on VariationTolerance
Conventional variation tolerance
Using large timing safety margin
Implies aggressive timing target
Greater power dissipation
Observation
Near-worst-case variations occur rarely
Safety margin is applied continuously toguard the small chance of variations
Poor power efficiency
8/13/2019 LOW POWER DESIGN METHODODLOGIES
31/52
ELEN 468 Lecture 29 32
Question..
Can we deal with errors instead
preventing them from occurring by
conservative binning/clocking?
How fast can we speed up the
circuit with error rate inmanageable range?
8/13/2019 LOW POWER DESIGN METHODODLOGIES
32/52
ELEN 468 Lecture 29 33
Fault tolerant system
Begin with reference values
Introduce redundancy Hardware: Triple Modular Redundancy
Time: Repeated process
Information: Code
Software: various algorithm
How about for delay fault?
how do we detect (may be correct?) errors?
8/13/2019 LOW POWER DESIGN METHODODLOGIES
33/52
ELEN 468 Lecture 29 34
Delay fault tolerant system
Delay fault detection Redundant timing margin in signal path
+: Second sampling at increase clock period
- : Decrease delay of reference signal between
pipeline registers
t1 t2
Timing margin
2ndsampling
t
8/13/2019 LOW POWER DESIGN METHODODLOGIES
34/52
ELEN 468 Lecture 29 35
Delay fault tolerant system
Delay fault removal Reference signal (SR)
Reprocessing at slower clock period (t)
t1 t2
Timing margin
t
SR
t
8/13/2019 LOW POWER DESIGN METHODODLOGIES
35/52
ELEN 468 Lecture 29 36
Delay fault tolerant system: Example
RAZOR* Dynamic Voltage Scaling Design
Reduce power voltage down tomanageable failure rate
t1 t2
Timing margin
* Razor: a low-power pipeline based on circuit-level timing speculation, D. Ernst et al, 36th Annual IEEE/ACM International Symposium on Microarchitecture 2003
8/13/2019 LOW POWER DESIGN METHODODLOGIES
36/52
ELEN 468 Lecture 29 37
RAZOR continued Implemented to 120MHz clock frequency
But for high speed circuits Managing two clocks
Minimum path delay constraint
Delay of MUX
Delay fault tolerant system: Example
8/13/2019 LOW POWER DESIGN METHODODLOGIES
37/52
ELEN 468 Lecture 29 38
Delay fault tolerant system: Example
Parity coding Parity generation based on output correlation
Avoid well-correlated outputs for pairing
Timing margin
t
8/13/2019 LOW POWER DESIGN METHODODLOGIES
38/52
ELEN 468 Lecture 29 39
Now.. Lets look at delay distribution(s)
8/13/2019 LOW POWER DESIGN METHODODLOGIES
39/52
ELEN 468 Lecture 29 40
Clock speed achieved for contained error rate
8/13/2019 LOW POWER DESIGN METHODODLOGIES
40/52
ELEN 468 Lecture 29 41
Delay fault tolerant system: Example
Parity coding (continued) Complexity
Example: C449 ISCAS Benchmark
8/13/2019 LOW POWER DESIGN METHODODLOGIES
41/52
ELEN 468 Lecture 29 42
Recently Proposed Design
Fault detection Partial hardware and time redundancy
Timing margin
t
Ln Ln+1
g0 gm
L'n+1
FL BL
gm
BL'
gi
8/13/2019 LOW POWER DESIGN METHODODLOGIES
42/52
ELEN 468 Lecture 29 43
Proposed Design
Fault removal Pipeline flush & reprocessing at lower
clock
Ln Ln+1
g0 gm
L'n+1
FL BL
gm
BL'
gi
8/13/2019 LOW POWER DESIGN METHODODLOGIES
43/52
ELEN 468 Lecture 29 44
Proposed Design
Division of FL an BL
PI PO
Latch
FL BL
CP
Error?BL
8/13/2019 LOW POWER DESIGN METHODODLOGIES
44/52
ELEN 468 Lecture 29 45
Proposed Design
Division of FL an BL
Considerations The effects on the original circuit should be
minimal.
Maximize delay fault detection coverage
Minimize added complexity
8/13/2019 LOW POWER DESIGN METHODODLOGIES
45/52
ELEN 468 Lecture 29 46
Proposed Design
Division of FL an BL First, POs to BL
Gate with longest delay to gate with shortest delay
For the gates connected to BL, Choose the gate with maximum delay
Then, any gate whose number of fanout> number of fanin
8/13/2019 LOW POWER DESIGN METHODODLOGIES
46/52
ELEN 468 Lecture 29 47
Proposed Design
Delay fault detection coverage dFL: delay from PI to any gate in FL
di: delay from PI to any gate in original circuit
max{ }1
max{ }
FLF
i
dC
d
Add graphical view
8/13/2019 LOW POWER DESIGN METHODODLOGIES
47/52
ELEN 468 Lecture 29 48
Proposed Design
Delay simulation SPICE simulation
TSMC 0.18um tech. Vcc=1.6V Gate delay for rising and falling signal
Load: inverter
Different input combinations are considered
Delay simulation Randomly generated test vectors
106~108according to number of primary inputs (PI)
8/13/2019 LOW POWER DESIGN METHODODLOGIES
48/52
ELEN 468 Lecture 29 49
Proposed Design
Area complexity Ngate:Number of gates in the original circuit
Nff :Number of ffs in each pipeline, (NPI+NPO)/2 Ngate_BL:Number of gates in BL
Ngate_CP:Number of gates in comparison block
NLatch:Number oflatches=Number of
connections between FL and BL w: Complexity ratio of flipflop to gate
_ _gate BL gate CP LatchA
gate ff
N N NC
N w N
8/13/2019 LOW POWER DESIGN METHODODLOGIES
49/52
ELEN 468 Lecture 29 50
Fault Coverage vs. ComplexityFault Detection Coverage vs. Added Complexity : C499
0
0.1
0.2
0.3
0.4
0.5
0.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Fault detection Coverage CF
AddedComplexityCA
Fault Detection Covera ge vs. Adde d Complexity: C432
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6
Fault detection Coverage CF
AddedComplexityCA
Fault Detection Coverage vs. Added Complexity: C880
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6
Fault detection Coverage CF
AddedComplexityCA
Fault Detection Coverage vs. Added Complexity: C6288
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6
Fault detection Coverage CF
AddedComplex
ityCA
8/13/2019 LOW POWER DESIGN METHODODLOGIES
50/52
ELEN 468 Lecture 29 51
Complexity
Effective complexity penalty
Depends on application
More than half of area is cache
Speed critical part: integer unit
0.5
AE A AAppicable areaC C CTotal chip area
8/13/2019 LOW POWER DESIGN METHODODLOGIES
51/52
ELEN 468 Lecture 29 52
Estimation of Complexity
& AGUDataCache
AlignMux
RegistersALUs
Intel Pentium 4
Processor on 90 nm
Process
8/13/2019 LOW POWER DESIGN METHODODLOGIES
52/52
Conclusion
Delay fault tolerant design is proposed
Possible operation clock frequency gain is
estimated from modeling and experiments Delay fault detection coverage and complexity
are analyzed for optimal implementation
It shows that 10% clock frequency gain is
possible with proposed design at a moderate (8-
25%) complexity increase