Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
SPORTLabSPORTLab
Massoud PedramUniversity of Southern CaliforniaDept. of Electrical Engineering
Presentation at ISPDMarch 28, 2011
Robust Design of Power-Efficient VLSI Circuits
M. Pedram
Outline
• Background and Motivation• Key Problem: Robust & Energy-Efficient Design• Essential Elements of the Solution Approach• Changing Landscape and New Opportunities• Conclusion
2
M. Pedram
More Functionality and Higher Performance
3
Samsung’s Galaxy 4G Android Smartphone
Intel’s many-core Knights platform
Apple’s iphone 4 and ipad 2
Nvidia’s GeForceGTX 590 graphics processor
IBM supercomputer Watson wins Jeopardy!
Marvell’s ARMADA 1000 High-Def. Media Processor
ARM’s Cortex-A series of applications processors
M. Pedram
Power Efficiency
4
2010 ITRS’s Consumer Stationary Power Consumption
M. Pedram
Energy Cost
5
0
20
40
60
80
100
120
140
160
2000 01 02 03 04 05 06 07 08 09 10 11 12 13 14
Pow
er (B
KW
h)
Current TrendsImproved OperationsGreen Data Center Goal
0
20
40
60
80
100
120
140
160
2000 01 02 03 04 05 06 07 08 09 10 11 12 13 14
Pow
er (B
KW
h)
Current TrendsImproved Operations
Source: US EPA 2007
M. Pedram
Variations
6
Line-edge roughness
Random dopant fluctuation
Gate oxide thickness fluctuation
Normalized metal resistance data over 3 months
Normalized capacitance distribution
Sources: Kevin Nowka and Chandu Visweswariah
Back-end variability
M. Pedram
Variations
7
Example workload variations
Power supply droop map
Sources: Sani Nassif, J. Fredrich, L.A. Barroso Die thermal maps
CPU utilization of > 5,000 servers during a six-month period
M. Pedram
Overarching Goal
8
Minimize power consumption of digital VLSI circuits, in presence of PVT, workload, and SLA variations
M. Pedram
Double arrows indicate the desired scaling directionThe design space bounded by the three curves is diminishingRegion of uncertainty for designs is increasing
Shrinking Design Space and Increasing Uncertainty
Increase Performance
Reduce Switching Power
Reduce Leakage Power
Pleakage=const
Pswitching=const
fclk=const
Supply Voltage
Thr
esho
ld V
olta
ge
Region of Uncertainty Design space
9
M. Pedram
Required Components of a Global Solution
• Better characterizations, models, and calculators
• Multi-corner or statistical optimization• Augment design for runtime adaptability• Dynamic control based on in-situ sensing
10
M. Pedram 11
A Current Source Model
• The non-linear behavior of the logic cell:– 2-D lookup table to store I(Vi,Vo)
• Parasitic effect of the logic cell:– 2-D lookup tables to store Ci(Vi,Vo), CM(Vi,Vo) and Co(Vi,Vo)
• Series of Spice simulation to pre-characterize the components of CSM model
Co(Vi,Vo)I(Vi,Vo)
Vo
io Vi ii CM(Vi,Vo)
Ci(Vi,Vo)
( , ) 0o o o iL o i o M M
V V V VC C I V V C Ct t t t
Δ Δ Δ Δ+ + + − =Δ Δ Δ Δ
M. Pedram
Transition to “Physical” Gate Modeling:Controlled Current Source Models
SlewoSlewi
DelayoDelayo ?
Slewo?Slewi ?
Zload(s)
Drivermodel Load (pincap)
model12
M. Pedram
Data Explosion Problem
CLoad
InSlew
• Conventional logic cell delay calculation techniques ignore the actual shape of waveform
• Current Source Modeling (e.g., ECSM)– Two-dimensional table of voltage waveforms in terms of input slew and
output capacitance • Size of the CSM library is a serious concern
– Data volume orders of magnitude greater than a .Lib library– Multiple Libraries in the Process, Temperature, Voltage (PVT) space – Additionally the CSM library may contain power, noise, and variability
• Goals– Reduce library size while
maintaining accuracy– Parameterize all waveform
data vs. slew, cap, and PVT
13
M. Pedram
Variational Waveform Modeling
• Sources of variations as input parameter:– Such as supply voltage, temperature, Leff and Vth
• Pre-alignment operations– V-operators (shift and scale operations in the direction of the voltage
axis )
14
M. Pedram
Orthonormal Transformation
• Each normalized waveform between 0 and 1 is represented by coefficients: α0, α1, α2, …, αn
W=α0+ α1P1+ …+ αnPn
Bases for the new representational space
Construct orthonormal basis using Principal Component Analysis
Training set waveforms
P1 P2 P3 P4 P5
…
15
M. Pedram
• In practice each normalized waveform between 0 and 1 is represented by using fixed number e.g., 4 coefficients: a0,…,a3– Time vector extraction from the cell library– Pre-alignment – Preprocessing including shifting, scaling, averaging and weighting,– Basis set extraction by using (Robust) Principal Component Analysis
– (R)PCA– Coefficient calculation for m “most significant” basis vectors
a0, a1, a2, a3
CLoad
InSlew
CLoad
InSlew
Library Data Compression
16
M. Pedram
Variational Waveform Modeling
• A 65nm ECSM library with 43141 waveforms• Nominal process corner, 1.2 volt, and 25oC• Each gate characterized for 7x7x5x5 (input slew, output capacitive load,
supply voltage and temperature) combinations• A voltage waveform modeled by 21 uniform point increments• Used the first five coefficients of PCA (76% compression)
0 0.005 0.01 0.015 0.02 0.0250
0.2
0.4
0.6
0.8
1
1.2
Time
Vol
tage
Variational@ T=15, Vdd=1.1Nominal @ T=25, Vdd=1.2
Variational and nominal waveform for an inverter
-0.4 -0.2 0 0.2 0.4 0.60
0.2
0.4
0.6
0.8
1
Time
Vol
tage
First BasisSecond BasisThird Basis
1st, 2nd, and 3rd basis vectors
0 0.02 0.04 0.06 0.08 0.1 0.120
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Histogram of Error
error bin
% fo
r who
le li
brar
y
10%
Histogram of relative error for 43141 waveforms (5 bases)
17
M. Pedram
Required Components of a Global Solution
• Better characterizations, models, and calculators
• Multi-corner or statistical optimization• Augment design for runtime adaptability• Dynamic control based on in-situ sensing
18
M. Pedram
Optimization flow: Multi-Corners + Multi-Modes
19
Lib Lib LibLib
Mobileconstraints
Desktopconstraints
Corner 1 Corner 2 Corner 3 Corner 4Un-optimizedDesign
Optimized for corner 1
Optimized for corner 2
Optimized for corner 3
Optimized for corner 4
Optimized Mobile design
Optimized Desktop design
M. Pedram
Multiobjective Optimization
where
fi : Rn→R = objective function
k (≥ 2) = number of (conflicting) objective functions
x = decision vector (of n decision variables xi)
S ⊂ Rn = feasible region formed by constraint functions and
``minimize´´ = minimize the objective functions simultaneously
S
Z
We consider multiobjective optimization problems:
20
M. Pedram
Concepts• A decision maker (DM) is needed to identify a final Pareto optimal
(PO) solution. (S)he has insight into the problem and can express preference relations
• An analyst is responsible for the mathematical side• Solution process = finding a solution• Final solution = feasible PO solution satisfying the DM• Ranges of the PO set: ideal objective vector z* (lower bounds of the
PO set) and approximated nadir objective vector znad (upper bounds of the PO set)
• Utopian objective vector, z**, is strictly better than z*
21
≤ ≤
∈
=
=
*
1
*
max ( )
argmin ( )
nadi i jj k
j jx S
f f x
x f x∈=* min ( )i ix S
f f x
znad
z*
M. Pedram
Concepts cont.• Value function U:Rk→R may represent preferences; other times DM
is expected to be maximizing some value (or utility)• We use the notation u:Rn→R where u(x)=U(f(x))
• If U(z1) > U(z2), then the DM prefers z1 to z2. If U(z1) = U(z2) then z1 and z2 are equally good (indifferent)
• Decision making can be thought of being based on either value maximization or satisficing
• An objective vector containing the aspiration levels ži of the DM is called a reference point, ž ∈Rk
• Problems are usually solved by scalarization, where a real-valued objective function is formed (depending on parameters). Then, single objective optimizers can be used!
22
M. Pedram
A Posteriori Methods: Weighting and ε-Constraint Methods
23
• Generate the PO set (or a part of it); Present it to the DM; Let the DM select one
• Problem
• Problem
M. Pedram
• The DM must specify an aspiration level ži for each objective function
• fi and aspiration level = a goal. Deviations from aspiration levels are minimized (fi(x) – δi = ži where δi may be positive or negative)
• The deviations can be represented as overachievements δi > 0 if fi(x) ≤ ži
• Weighted approach:
• With x and δi (i=1,…,k) as variables
Weights from the DM
A Priori Methods: Goal Programming
24
M. Pedram
• Idea: To classify the objective functions:– Functions to be improved (I<) – Functions whose values can be relaxed (I>)– Acceptable functions (I=)
–Notice that I< U I> U I= = {1,...,k}
• Assumptions– Trade-off information is available in the KT-multipliers
• Aspiration levels for functions in I< given by the DM, upper bounds for function in I> from the KT-multipliers
• Satisficing decision making is emphasized
Interactive Methods: SatisficingTrade-Off Method
25
M. Pedram
Satisficing Trade-Off Method cont.
26
minimizesubj. to x S∈
minimizesubj. to x S∈
• Problem
M. Pedram
Some Results – Transistor Sizing
27
4.03
3.52
3.25
3.95
3.48
3.26
3.88
3.5
3.31
3.87
3.56
3.35
3.23.253.3
3.353.4
3.453.5
3.553.6
3.653.7
3.753.8
3.853.9
3.954
4.05
1.8 2.1 2.4
Dela
y in
ns
Voltage Corners
optimization in 2.4V optimization in 2.1V
multi-objective optimization optimization in 1.8V
M. Pedram
Simulation results for the adder
28
M. Pedram
Required Components of a Global Solution
• Better characterizations, models, and calculators
• Multi-corner or statistical optimization• Augment design for runtime
adaptability• Dynamic control based on in-situ sensing
29
M. Pedram
• A voltage regulator module (VRM) converts and regulates a DC power source– A VRM can typically produce one of a number of distinct voltage
levels based on user-specified voltage identification (VID) code– A VRM must meet various voltage tolerances such as voltage
droops, output ripple and noise, dynamic load limit, etc.• Depending on these tolerances, the response time of the VRM can change
An example of voltage droops
On-Chip Power Delivery Network
M. Pedram
• A PDN configuration is defined by a VRM tree, which is a rooted tree with – A root node representing the main power source (battery)– Internal nodes denoting various VRM’s– Leaf nodes representing functional blocks (FB’s) with their supply voltage
and peak current levelsVRM tree
Voltage
P
VRM3
VRM4
VRM5
VRM6
VRM1
VRM2
[1.5 3.3]
[2.5]
[3.3 5.0]
[3.3]
Current
< 100mA
< 1A
< 500mA
< 1.2A
VRMs-to-FBs mapping
FB1
FB2
FB3
FB4
VRM tree with VRM-to-FB mapping
Voltage Regulator Module Tree
M. Pedram
Energy-Delay Optimal Pipeline Design
• Key idea: Allow the data to pass through a flip flop during some transparency window, instead of on the triggering edge of a clock
• Benefit : Enable opportunistic time borrowing across adjacent pipeline stages– The goal is to provide the timing-critical stages with more time to complete their
computations• Thus, reducing the probability of timing errors
• Determine values of: – operating frequency, supply voltage level, transparency window sizes of
the individual soft-edge FF-sets• Problem Formulations:
– stage delays are modeled as random variables due to process variations– allow timing violations to take place but then implement a mechanism to
detect and fix it
32
D Q D Q D QC1 C2FF0 FF1 FF2
d2d1
borrowed
M. Pedram
Soft Edge Flip Flops
D Q
CLK
SEFF
Clk
ClkD
Tran
spar
ency
Win
dow
DATAClk
D Q
Clk ClkD
Clk Clk
Clk
Clk
ClkD ClkD
ClkD
ClkD
ClkDDelay
-30
-15
0
0
50
100
150
200
40 60 80 100
Setu
p Ti
me
[ps]
Hold
-C2
Q -D
2Q [p
s]
Transparency Window [ps]
DtoQ
CtoQ
Hold
Setup
-30
-20
-10
0
10
20
30
50 70 90 110
Setu
p [p
s]
Transparency Window [ps]
0.80.850.90.951
, ,, , , ,
0
100
200
300
400
500
600
700
800
900
0 50 100 150 200 250
Pow
er C
onsu
mpt
ion
(uW
)
Transparency Window (ps)
Vdd=1.0
Vdd=0.9∙1
33
M. Pedram
Power-Delay Optimal Soft Pipeline
Design of power-delay optimal soft pipeline (OSP)• minimize total power-delay
(energy) of an N-stage pipeline, by finding optimal values of:– global supply voltage– pipeline clock period– transparency windows of
SEFF sets• Solution:
– enumerate all possible values for v,
– for each v, optimally solve a quadratic program, OSP-FV:
34
M. Pedram
SEFF with Built-in Error Detection
• A shadow latch resamples the input data with some phase delay– Need a phase-shifted global clock
signal, PS-Clk• The two sampled data are
compared to one another to detect and flag errors– Error can be handled at the higher
level (e.g., pipeline stall)
PS-Clk
D Q
Main Clk
C1SEFF
Shad
owLa
tch
Error
Main Clock
Phase ShiftedClock
Data is ready
Re-sampling Data
Sampling Data in regular SEFF
D Q
Clk Clk
~Clk
~Clk
~ClkD ~ClkD
ClkD
~PS-Clk
PS-ClkError
Error
ClkD
35
M. Pedram
Error-Tolerant Statistical Power-Delay Optimal Soft Pipeline
36
M. Pedram
Multi-Threshold CMOS• High-Vth power switches are
connected to low-Vth logic gates– Achieves high performance due to
low-Vth logic gates– Reduces leakage power
dramatically due to the series-connected high-Vth power switch
• Typically only a header or a footer sleep transistor is used, not both
• A single sleep transistor may be shared among several logic gates
Gate1
VVSS
Gate2 Gate3
SLEEP
SLEEP
High-Vth
SLEEPVirtual VDD
Virtual VSS
Low-Vth logicgates
37
M. Pedram
Tri-Modal MTCMOS Switch
SLEEP
Sleep Inverter
VVSS
MS1
MS2
MS
Circuit Block
VDD VDD
DROWSYMD1
MD2
GS
SLEEP/DROWSY Multi-Mode Switch
Function 0X Active 10 Sleep 11 Drowsy
38
M. Pedram
Multimodal Power-Gated Pipeline Architecture
• Use different TM (Tri-Modal) switches for pipeline registers and for combinational logic gates so as to enable different power-gating modes for these circuit elements
TM Switch 1
ds
TM Switch 2
ds
VSSVSSVSS
VSSVSSDatainDataout
SleepDrowsy1
SleepDrowsy2
39
M. Pedram
• We designed and implemented a 16×16 pipelined carry save multiplier (CSM) using TSMC 0.18um CMOS– The circuit is divided into two pipeline stages– The 46-bit output of the first stage is latched into the pipeline
registers (46 FF’s)– The first 16 bits out of these 46 bits make the first 16 bits of the
product and are passed to the output directly– The last 30 bits are passed to the second stage making the last
16 bits of the product
HDL (Verilog-XL)
Synthesis (Design Compiler)
Timing Analysis
Place and Route(Encounter)
Netlist Extractionand
HSPICE Simulations40
Experimental Results
M. Pedram
Results, cont’d
41
• Four circuits, which are in the standby modes, are compared :
Circuit Type Leakage (nA)
Ground-Bounce
(mV)
Wakeup/ReadyLatency (ns)
CMOS 63 - -Deep-Sleep 0.10 473 19.32Drowsy 48 143 4.83Data-Retentive 2.85 441 19.32
Circuit TypeStage delay (ns)
Cell area(um2)
Wire length (um)
Wire length (um)
ni=1 ni=2
CMOS 4.54 54720 54402.6 54402.6MTCMOS 4.83 55710 59008.4 56077.2% Increase 6.4 1.8 8.5 3.1
– CMOS– Deep-sleep MTCMOS:
all the cells (including FF’s and logic gates) are in the sleep mode
– Drowsy MTCMOS: all the cells (including FF’s and gates) are in the drowsy mode
– Data-retentive MTCMOS: Logic cells are in sleep mode and pipeline FF’s are in drowsy mode
M. Pedram
Required Components of a Global Solution
• Better characterizations, models, and calculators
• Multi-corner or statistical optimization• Augment design for runtime adaptability• Dynamic control based on in-situ
sensing
42
M. Pedram
Architectures That Tolerate Uncertainty
• Due to variations and computational constraints, there will be significant uncertainty about the real state of a circuit as a result of an optimization decision
• Need solution techniques that can learn from their mistakes and/or successes on similar problem instances encountered in the past to improve the quality of their decision making
• Utilize partially-observable (semi-)Markov decision process model
State estim ator
πa
obState
estim atorπ
ao
b
Observation Belief stateAction/command
Optimization policy
State estimator
43
M. Pedram
Supervisory Mode and Dynamic Control
• To avoid over constraining a system in terms of its power dissipation or temperature, one must adopt online control and supervision solutions that change the system behavior on the fly so as to maximize performance without violating the constraints (or provide the required performance while minimizing energy consumption).
44
M. Pedram
Idleness, State Transitions, and Power Savings
45Source: W-D. Weber
Time scale
ns ms sec min hour day
Components sleeping Machines sleeping
Better active states
More sleepstates
Thread packing Job packing(to cores) (to machines)
M. Pedram
Reality: Non-Energy-Proportional Servers
46M. Pedram, USC
Power dissipation of Intel Xeon 5400 series server at high (2.3GHz) and low (2GHz) settings
12
17
22
27
32
37
42
0 100 200 300 400
pow
er(W
)
total util(%)
Here P1 denotes server power dissipation at 100% utilization, whereas PU is power at utilization level of UAn energy proportional system will have an Energy Inefficiency Factor (EIF) of one at all utilization levels
M. Pedram
GQ
Task
A
ssig
nmen
t
PMU
Core
Task
LQ
Minimum Energy CMP Design with Core Consolidation and DVFS
• Chip Multiprocessor system with M identical cores– Per-core DVFS with N (voltage-frequency) configurations– Local Queue
• Power Management Unit• Global Queue• Task Dispatcher• Tasks
– Expected execution time, τ – Expected instructions per cycle
Objective: Minimize the total Energy consumption of cores in a multicore system
• Constraints: Minimum required throughput, IPSreq• Variables:
– Number of active cores –total number of cores: M– v-f settings of cores –total settings: J– Task distribution– total number of tasks: K
47
M. Pedram
A Hierarchical Solution• Determine the number of ON cores
– ON cores are in C0 when active or C2 (halt or sleep) when idle
• Determine operating frequency of ON cores– A feedback-based control method is adopted for DVFS setting
• This is needed due to inherent uncertainty and variability of task characteristics
• PI Controller: controller adjusts the v-f setting to match the required throughput based on the observed error
• Feedback control loop determines a single optimum frequency for all cores and then the Quantizer would translate fopt to the available DVFS levels
• Find a feasible assignment of tasks to the ON cores
m
Energy
mopt
48
M. Pedram
Simulation Results• Comparison to a relatively energy-efficient baseline PM
– It implements the same power reduction techniques as our method– It utilizes open loop DVFS, and does not support smart wakeup and
shutdown• The figure compares the frequency setting and throughput
– The baseline PM always runs with all cores ON– The closed loop frequency is always lower for the same throughput– In the second 100ms, the baseline chooses lower frequency with four
cores running while our proposed method uses only three cores• The proposed PM gains an overall energy consumption of 17%
0
5000
10000
15000
20000
25000
0
500
1000
1500
2000
2500
1 611 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101
106
111
116
121
126
131
136
141
146
151
156
161
166
171
176
181
186
191
196
201
206
211
216
221
226
231
236
241
246
251
256
261
266
271
276
281
286
291
296
Thro
ugh
pu
t [M
IPS]
Freq
uen
cy [M
Hz]
Time [ms]
f_proposed
f_base
H_target
H_proposed
H_basem=3
m=4
m=4m=3
m=4
m=4
49
M. Pedram
Changing Landscape: Smart Grid and Dynamic Pricing
50
Source: M. Martonosi
M. Pedram
Electrical Energy Storage Systems
51
M. Pedram
Conclusion
• These are exciting times with many new opportunities and challenges due to planned upgrades to the Power Grid, introduction of renewable sources of energy, smart metering and dynamic energy pricing, people’s awareness of environmental issues, etc.
• A holistic , cross-layer approach to energy efficiency and robustness is needed, which spans– Application efficiency and energy management, micro-architecture and
system design, storage and networks, resource management and scheduling
– Synthesis and physical design– Library characterization and cell design
52