Robust Design of Power-Efficient VLSI Circuits

SPORTLabSPORTLab

Massoud PedramUniversity of Southern CaliforniaDept. of Electrical Engineering

Presentation at ISPDMarch 28, 2011

Robust Design of Power-Efficient VLSI Circuits

M. Pedram

Outline

• Background and Motivation• Key Problem: Robust & Energy-Efficient Design• Essential Elements of the Solution Approach• Changing Landscape and New Opportunities• Conclusion

2

M. Pedram

More Functionality and Higher Performance

3

Samsung’s Galaxy 4G Android Smartphone

Intel’s many-core Knights platform

Apple’s iphone 4 and ipad 2

Nvidia’s GeForceGTX 590 graphics processor

IBM supercomputer Watson wins Jeopardy!

Marvell’s ARMADA 1000 High-Def. Media Processor

ARM’s Cortex-A series of applications processors

M. Pedram

Power Efficiency

4

2010 ITRS’s Consumer Stationary Power Consumption

M. Pedram

Energy Cost

5

0

20

40

60

80

100

120

140

160

2000 01 02 03 04 05 06 07 08 09 10 11 12 13 14

Pow

er (B

KW

h)

Current TrendsImproved OperationsGreen Data Center Goal

0

20

40

60

80

100

120

140

160

2000 01 02 03 04 05 06 07 08 09 10 11 12 13 14

Pow

er (B

KW

h)

Current TrendsImproved Operations

Source: US EPA 2007

M. Pedram

Variations

6

Line-edge roughness

Random dopant fluctuation

Gate oxide thickness fluctuation

Normalized metal resistance data over 3 months

Normalized capacitance distribution

Sources: Kevin Nowka and Chandu Visweswariah

Back-end variability

M. Pedram

Variations

7

Example workload variations

Power supply droop map

Sources: Sani Nassif, J. Fredrich, L.A. Barroso Die thermal maps

CPU utilization of > 5,000 servers during a six-month period

M. Pedram

Overarching Goal

8

Minimize power consumption of digital VLSI circuits, in presence of PVT, workload, and SLA variations

M. Pedram

Double arrows indicate the desired scaling directionThe design space bounded by the three curves is diminishingRegion of uncertainty for designs is increasing

Shrinking Design Space and Increasing Uncertainty

Increase Performance

Reduce Switching Power

Reduce Leakage Power

Pleakage=const

Pswitching=const

fclk=const

Supply Voltage

Thr

esho

ld V

olta

ge

Region of Uncertainty Design space

9

M. Pedram

Required Components of a Global Solution

• Better characterizations, models, and calculators

• Multi-corner or statistical optimization• Augment design for runtime adaptability• Dynamic control based on in-situ sensing

10

M. Pedram 11

A Current Source Model

• The non-linear behavior of the logic cell:– 2-D lookup table to store I(Vi,Vo)

• Parasitic effect of the logic cell:– 2-D lookup tables to store Ci(Vi,Vo), CM(Vi,Vo) and Co(Vi,Vo)

• Series of Spice simulation to pre-characterize the components of CSM model

Co(Vi,Vo)I(Vi,Vo)

Vo

io Vi ii CM(Vi,Vo)

Ci(Vi,Vo)

( , ) 0o o o iL o i o M M

V V V VC C I V V C Ct t t t

Δ Δ Δ Δ+ + + − =Δ Δ Δ Δ

M. Pedram

Transition to “Physical” Gate Modeling:Controlled Current Source Models

SlewoSlewi

DelayoDelayo ?

Slewo?Slewi ?

Zload(s)

Drivermodel Load (pincap)

model12

M. Pedram

Data Explosion Problem

CLoad

InSlew

• Conventional logic cell delay calculation techniques ignore the actual shape of waveform

• Current Source Modeling (e.g., ECSM)– Two-dimensional table of voltage waveforms in terms of input slew and

output capacitance • Size of the CSM library is a serious concern

– Data volume orders of magnitude greater than a .Lib library– Multiple Libraries in the Process, Temperature, Voltage (PVT) space – Additionally the CSM library may contain power, noise, and variability

• Goals– Reduce library size while

maintaining accuracy– Parameterize all waveform

data vs. slew, cap, and PVT

13

M. Pedram

Variational Waveform Modeling

• Sources of variations as input parameter:– Such as supply voltage, temperature, Leff and Vth

• Pre-alignment operations– V-operators (shift and scale operations in the direction of the voltage

axis )

14

M. Pedram

Orthonormal Transformation

• Each normalized waveform between 0 and 1 is represented by coefficients: α0, α1, α2, …, αn

W=α0+ α1P1+ …+ αnPn

Bases for the new representational space

Construct orthonormal basis using Principal Component Analysis

Training set waveforms

P1 P2 P3 P4 P5

…

15

M. Pedram

• In practice each normalized waveform between 0 and 1 is represented by using fixed number e.g., 4 coefficients: a0,…,a3– Time vector extraction from the cell library– Pre-alignment – Preprocessing including shifting, scaling, averaging and weighting,– Basis set extraction by using (Robust) Principal Component Analysis

– (R)PCA– Coefficient calculation for m “most significant” basis vectors

a0, a1, a2, a3

CLoad

InSlew

CLoad

InSlew

Library Data Compression

16

M. Pedram

Variational Waveform Modeling

• A 65nm ECSM library with 43141 waveforms• Nominal process corner, 1.2 volt, and 25oC• Each gate characterized for 7x7x5x5 (input slew, output capacitive load,

supply voltage and temperature) combinations• A voltage waveform modeled by 21 uniform point increments• Used the first five coefficients of PCA (76% compression)

0 0.005 0.01 0.015 0.02 0.0250

0.2

0.4

0.6

0.8

1

1.2

Time

Vol

tage

Variational@ T=15, Vdd=1.1Nominal @ T=25, Vdd=1.2

Variational and nominal waveform for an inverter

-0.4 -0.2 0 0.2 0.4 0.60

0.2

0.4

0.6

0.8

1

Time

Vol

tage

First BasisSecond BasisThird Basis

1st, 2nd, and 3rd basis vectors

0 0.02 0.04 0.06 0.08 0.1 0.120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Histogram of Error

error bin

% fo

r who

le li

brar

y

10%

Histogram of relative error for 43141 waveforms (5 bases)

17

M. Pedram



• Multi-corner or statistical optimization• Augment design for runtime adaptability• Dynamic control based on in-situ sensing

18

M. Pedram

Optimization flow: Multi-Corners + Multi-Modes

19

Lib Lib LibLib

Mobileconstraints

Desktopconstraints

Corner 1 Corner 2 Corner 3 Corner 4Un-optimizedDesign

Optimized for corner 1




Optimized Mobile design

Optimized Desktop design

M. Pedram

Multiobjective Optimization

where

fi : Rn→R = objective function

k (≥ 2) = number of (conflicting) objective functions

x = decision vector (of n decision variables xi)

S ⊂ Rn = feasible region formed by constraint functions and

``minimize´´ = minimize the objective functions simultaneously

S

Z

We consider multiobjective optimization problems:

20

M. Pedram

Concepts• A decision maker (DM) is needed to identify a final Pareto optimal

(PO) solution. (S)he has insight into the problem and can express preference relations

• An analyst is responsible for the mathematical side• Solution process = finding a solution• Final solution = feasible PO solution satisfying the DM• Ranges of the PO set: ideal objective vector z* (lower bounds of the

PO set) and approximated nadir objective vector znad (upper bounds of the PO set)

• Utopian objective vector, z**, is strictly better than z*

21

≤ ≤

∈

=

=

*

1

*

max ( )

argmin ( )

nadi i jj k

j jx S

f f x

x f x∈=* min ( )i ix S

f f x

znad

z*

M. Pedram

Concepts cont.• Value function U:Rk→R may represent preferences; other times DM

is expected to be maximizing some value (or utility)• We use the notation u:Rn→R where u(x)=U(f(x))

• If U(z1) > U(z2), then the DM prefers z1 to z2. If U(z1) = U(z2) then z1 and z2 are equally good (indifferent)

• Decision making can be thought of being based on either value maximization or satisficing

• An objective vector containing the aspiration levels ži of the DM is called a reference point, ž ∈Rk

• Problems are usually solved by scalarization, where a real-valued objective function is formed (depending on parameters). Then, single objective optimizers can be used!

22

M. Pedram

A Posteriori Methods: Weighting and ε-Constraint Methods

23

• Generate the PO set (or a part of it); Present it to the DM; Let the DM select one

• Problem

• Problem

M. Pedram

• The DM must specify an aspiration level ži for each objective function

• fi and aspiration level = a goal. Deviations from aspiration levels are minimized (fi(x) – δi = ži where δi may be positive or negative)

• The deviations can be represented as overachievements δi > 0 if fi(x) ≤ ži

• Weighted approach:

• With x and δi (i=1,…,k) as variables

Weights from the DM

A Priori Methods: Goal Programming

24

M. Pedram

• Idea: To classify the objective functions:– Functions to be improved (I<) – Functions whose values can be relaxed (I>)– Acceptable functions (I=)

–Notice that I< U I> U I= = {1,...,k}

• Assumptions– Trade-off information is available in the KT-multipliers

• Aspiration levels for functions in I< given by the DM, upper bounds for function in I> from the KT-multipliers

• Satisficing decision making is emphasized

Interactive Methods: SatisficingTrade-Off Method

25

M. Pedram

Satisficing Trade-Off Method cont.

26

minimizesubj. to x S∈

minimizesubj. to x S∈

• Problem

M. Pedram

Some Results – Transistor Sizing

27

4.03

3.52

3.25

3.95

3.48

3.26

3.88

3.5

3.31

3.87

3.56

3.35

3.23.253.3

3.353.4

3.453.5

3.553.6

3.653.7

3.753.8

3.853.9

3.954

4.05

1.8 2.1 2.4

Dela

y in

ns

Voltage Corners

optimization in 2.4V optimization in 2.1V

multi-objective optimization optimization in 1.8V

M. Pedram

Simulation results for the adder

28

M. Pedram



• Multi-corner or statistical optimization• Augment design for runtime

adaptability• Dynamic control based on in-situ sensing

29

M. Pedram

• A voltage regulator module (VRM) converts and regulates a DC power source– A VRM can typically produce one of a number of distinct voltage

levels based on user-specified voltage identification (VID) code– A VRM must meet various voltage tolerances such as voltage

droops, output ripple and noise, dynamic load limit, etc.• Depending on these tolerances, the response time of the VRM can change

An example of voltage droops

On-Chip Power Delivery Network

M. Pedram

• A PDN configuration is defined by a VRM tree, which is a rooted tree with – A root node representing the main power source (battery)– Internal nodes denoting various VRM’s– Leaf nodes representing functional blocks (FB’s) with their supply voltage

and peak current levelsVRM tree

Voltage

P

VRM3

VRM4

VRM5

VRM6

VRM1

VRM2

[1.5 3.3]

[2.5]

[3.3 5.0]

[3.3]

Current

< 100mA

< 1A

< 500mA

< 1.2A

VRMs-to-FBs mapping

FB1

FB2

FB3

FB4

VRM tree with VRM-to-FB mapping

Voltage Regulator Module Tree

M. Pedram

Energy-Delay Optimal Pipeline Design

• Key idea: Allow the data to pass through a flip flop during some transparency window, instead of on the triggering edge of a clock

• Benefit : Enable opportunistic time borrowing across adjacent pipeline stages– The goal is to provide the timing-critical stages with more time to complete their

computations• Thus, reducing the probability of timing errors

• Determine values of: – operating frequency, supply voltage level, transparency window sizes of

the individual soft-edge FF-sets• Problem Formulations:

– stage delays are modeled as random variables due to process variations– allow timing violations to take place but then implement a mechanism to

detect and fix it

32

D Q D Q D QC1 C2FF0 FF1 FF2

d2d1

borrowed

M. Pedram

Soft Edge Flip Flops

D Q

CLK

SEFF

Clk

ClkD

Tran

spar

ency

Win

dow

DATAClk

D Q

Clk ClkD

Clk Clk

Clk

Clk

ClkD ClkD

ClkD

ClkD

ClkDDelay

-30

-15

0

0

50

100

150

200

40 60 80 100

Setu

p Ti

me

[ps]

Hold

-C2

Q -D

2Q [p

s]

Transparency Window [ps]

DtoQ

CtoQ

Hold

Setup

-30

-20

-10

0

10

20

30

50 70 90 110

Setu

p [p

s]

Transparency Window [ps]

0.80.850.90.951

, ,, , , ,

0

100

200

300

400

500

600

700

800

900

0 50 100 150 200 250

Pow

er C

onsu

mpt

ion

(uW

)

Transparency Window (ps)

Vdd=1.0

Vdd=0.9∙1

33

M. Pedram

Power-Delay Optimal Soft Pipeline

Design of power-delay optimal soft pipeline (OSP)• minimize total power-delay

(energy) of an N-stage pipeline, by finding optimal values of:– global supply voltage– pipeline clock period– transparency windows of

SEFF sets• Solution:

– enumerate all possible values for v,

– for each v, optimally solve a quadratic program, OSP-FV:

34

M. Pedram

SEFF with Built-in Error Detection

• A shadow latch resamples the input data with some phase delay– Need a phase-shifted global clock

signal, PS-Clk• The two sampled data are

compared to one another to detect and flag errors– Error can be handled at the higher

level (e.g., pipeline stall)

PS-Clk

D Q

Main Clk

C1SEFF

Shad

owLa

tch

Error

Main Clock

Phase ShiftedClock

Data is ready

Re-sampling Data

Sampling Data in regular SEFF

D Q

Clk Clk

~Clk

~Clk

~ClkD ~ClkD

ClkD

~PS-Clk

PS-ClkError

Error

ClkD

35

M. Pedram

Error-Tolerant Statistical Power-Delay Optimal Soft Pipeline

36

M. Pedram

Multi-Threshold CMOS• High-Vth power switches are

connected to low-Vth logic gates– Achieves high performance due to

low-Vth logic gates– Reduces leakage power

dramatically due to the series-connected high-Vth power switch

• Typically only a header or a footer sleep transistor is used, not both

• A single sleep transistor may be shared among several logic gates

Gate1

VVSS

Gate2 Gate3

SLEEP

SLEEP

High-Vth

SLEEPVirtual VDD

Virtual VSS

Low-Vth logicgates

37

M. Pedram

Tri-Modal MTCMOS Switch

SLEEP

Sleep Inverter

VVSS

MS1

MS2

MS

Circuit Block

VDD VDD

DROWSYMD1

MD2

GS

SLEEP/DROWSY Multi-Mode Switch

Function 0X Active 10 Sleep 11 Drowsy

38

M. Pedram

Multimodal Power-Gated Pipeline Architecture

• Use different TM (Tri-Modal) switches for pipeline registers and for combinational logic gates so as to enable different power-gating modes for these circuit elements

TM Switch 1

ds

TM Switch 2

ds

VSSVSSVSS

VSSVSSDatainDataout

SleepDrowsy1

SleepDrowsy2

39

M. Pedram

• We designed and implemented a 16×16 pipelined carry save multiplier (CSM) using TSMC 0.18um CMOS– The circuit is divided into two pipeline stages– The 46-bit output of the first stage is latched into the pipeline

registers (46 FF’s)– The first 16 bits out of these 46 bits make the first 16 bits of the

product and are passed to the output directly– The last 30 bits are passed to the second stage making the last

16 bits of the product

HDL (Verilog-XL)

Synthesis (Design Compiler)

Timing Analysis

Place and Route(Encounter)

Netlist Extractionand

HSPICE Simulations40

Experimental Results

M. Pedram

Results, cont’d

41

• Four circuits, which are in the standby modes, are compared :

Circuit Type Leakage (nA)

Ground-Bounce

(mV)

Wakeup/ReadyLatency (ns)

CMOS 63 - -Deep-Sleep 0.10 473 19.32Drowsy 48 143 4.83Data-Retentive 2.85 441 19.32

Circuit TypeStage delay (ns)

Cell area(um2)

Wire length (um)

Wire length (um)

ni=1 ni=2

CMOS 4.54 54720 54402.6 54402.6MTCMOS 4.83 55710 59008.4 56077.2% Increase 6.4 1.8 8.5 3.1

– CMOS– Deep-sleep MTCMOS:

all the cells (including FF’s and logic gates) are in the sleep mode

– Drowsy MTCMOS: all the cells (including FF’s and gates) are in the drowsy mode

– Data-retentive MTCMOS: Logic cells are in sleep mode and pipeline FF’s are in drowsy mode

M. Pedram



• Multi-corner or statistical optimization• Augment design for runtime adaptability• Dynamic control based on in-situ

sensing

42

M. Pedram

Architectures That Tolerate Uncertainty

• Due to variations and computational constraints, there will be significant uncertainty about the real state of a circuit as a result of an optimization decision

• Need solution techniques that can learn from their mistakes and/or successes on similar problem instances encountered in the past to improve the quality of their decision making

• Utilize partially-observable (semi-)Markov decision process model

State estim ator

πa

obState

estim atorπ

ao

b

Observation Belief stateAction/command

Optimization policy

State estimator

43

M. Pedram

Supervisory Mode and Dynamic Control

• To avoid over constraining a system in terms of its power dissipation or temperature, one must adopt online control and supervision solutions that change the system behavior on the fly so as to maximize performance without violating the constraints (or provide the required performance while minimizing energy consumption).

44

M. Pedram

Idleness, State Transitions, and Power Savings

45Source: W-D. Weber

Time scale

ns ms sec min hour day

Components sleeping Machines sleeping

Better active states

More sleepstates

Thread packing Job packing(to cores) (to machines)

M. Pedram

Reality: Non-Energy-Proportional Servers

46M. Pedram, USC

Power dissipation of Intel Xeon 5400 series server at high (2.3GHz) and low (2GHz) settings

12

17

22

27

32

37

42

0 100 200 300 400

pow

er(W

)

total util(%)

Here P1 denotes server power dissipation at 100% utilization, whereas PU is power at utilization level of UAn energy proportional system will have an Energy Inefficiency Factor (EIF) of one at all utilization levels

M. Pedram

GQ

Task

A

ssig

nmen

t

PMU

Core

Task

LQ

Minimum Energy CMP Design with Core Consolidation and DVFS

• Chip Multiprocessor system with M identical cores– Per-core DVFS with N (voltage-frequency) configurations– Local Queue

• Power Management Unit• Global Queue• Task Dispatcher• Tasks

– Expected execution time, τ – Expected instructions per cycle

Objective: Minimize the total Energy consumption of cores in a multicore system

• Constraints: Minimum required throughput, IPSreq• Variables:

– Number of active cores –total number of cores: M– v-f settings of cores –total settings: J– Task distribution– total number of tasks: K

47

M. Pedram

A Hierarchical Solution• Determine the number of ON cores

– ON cores are in C0 when active or C2 (halt or sleep) when idle

• Determine operating frequency of ON cores– A feedback-based control method is adopted for DVFS setting

• This is needed due to inherent uncertainty and variability of task characteristics

• PI Controller: controller adjusts the v-f setting to match the required throughput based on the observed error

• Feedback control loop determines a single optimum frequency for all cores and then the Quantizer would translate fopt to the available DVFS levels

• Find a feasible assignment of tasks to the ON cores

m

Energy

mopt

48

M. Pedram

Simulation Results• Comparison to a relatively energy-efficient baseline PM

– It implements the same power reduction techniques as our method– It utilizes open loop DVFS, and does not support smart wakeup and

shutdown• The figure compares the frequency setting and throughput

– The baseline PM always runs with all cores ON– The closed loop frequency is always lower for the same throughput– In the second 100ms, the baseline chooses lower frequency with four

cores running while our proposed method uses only three cores• The proposed PM gains an overall energy consumption of 17%

0

5000

10000

15000

20000

25000

0

500

1000

1500

2000

2500

1 611 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101

106

111

116

121

126

131

136

141

146

151

156

161

166

171

176

181

186

191

196

201

206

211

216

221

226

231

236

241

246

251

256

261

266

271

276

281

286

291

296

Thro

ugh

pu

t [M

IPS]

Freq

uen

cy [M

Hz]

Time [ms]

f_proposed

f_base

H_target

H_proposed

H_basem=3

m=4

m=4m=3

m=4

m=4

49

M. Pedram

Changing Landscape: Smart Grid and Dynamic Pricing

50

Source: M. Martonosi

M. Pedram

Electrical Energy Storage Systems

51

M. Pedram

Conclusion

• These are exciting times with many new opportunities and challenges due to planned upgrades to the Power Grid, introduction of renewable sources of energy, smart metering and dynamic energy pricing, people’s awareness of environmental issues, etc.

• A holistic , cross-layer approach to energy efficiency and robustness is needed, which spans– Application efficiency and energy management, micro-architecture and

system design, storage and networks, resource management and scheduling

– Synthesis and physical design– Library characterization and cell design

52

Documents

Robust Design of Power-Efficient VLSI Circuits