L3 - System on Chip

8/11/2019 L3 - System on Chip

1/68


2/68

Outline The framework: Computing platforms in the broad sense

Historical Trends towards Multi-Core through MooresLaw

, -Computation and Limits

Opportunities to Improve Energy Efficiency/VoltageScalability

Beyond-CMOS Ultra-Low Voltage Circuits

2 prof. Massimo Alioto


3/68

The Framework:



4/68

Computing Platforms: The Big Picture

Computing/sensing platforms are rapidly expanding*

networks move towards macro and nano scale

nano scale (self-powered nodes)

meso scale (portable/handheld)

macro scale (data centers)

macro : cloud computing

nano : ubi uitous com utin /sensin


* adapted from MuSyC FCRP center


5/68


Computing/sensing platforms are rapidly expanding networks move towards macro and nano scale

mesonano

macro



6/68



mesonano

macro

New concepts

Internet of things



7/68



mesonano

New conceptsmacro

ntra ervous

System for the Earth



8/68



mesonano

New concepts-

macro

collectiveintelligence,



9/68



mesonano

New applicationsmacro



10/68



mesonano

New applicationsmacro



11/68



mesonano

New applications

macro

advanced water/energymanagement,



12/68

Historical Trends towards Multi-Core



13/68

CMOS Integrated Circuits

MOS transistor Shockley-Brattain-Bardeen

(1947, Bell Labs)

Integrated Circuit (IC) chipmultiple transistors +

interconnects =--------------------------

Jack Kilby (1958)

packaged chip


demonstrated 1 st IC PCB


14/68

Gordon Moores Prediction

CMOS technology scaling

XY

Z

0.7X0.7Y

0.7Z

2X more transistors/chipprevious generation next generation

Prediction in 1965 (not a law) Moores law: 1 generation/24 months

exponential growth in transistor count



15/68


16/68

colossal investments coordinated by International Technology Roadmap for

process, device, circuits

challenges, performance,

consumption, capabilities,



17/68

As a Result of CMOS Scaling

CMOS scaling trends for microprocessors (macro scale) before 2005: Moores law + Dennards scaling (voltage )

exponential growth in # of transistors and performance

I n t el

1 2 m

I n t el

1 2 m


0 0 4

2 0 0 4

2


18/68


19/68

Power vs. energy efficiency

lkg opchip P throughput E P

keep leakage power small enough (10-15%)

- E op V 2

single-core performance V

I n t el

1 2 m


0 0 4

2


20/68

Power vs. energy efficiency


keep leakage power small enough (10-15%)

- E op V 2

single-core performance V

E op by using lower V

I n t el

1 2 m

improve peformance


0 0 4

2


21/68

Multi-Core: Numerical Example

Post-Dennard scaling keep V TH , V DD constant performance becomes power limited use area (moretransistors available use them to improve efficiency)

use silicon for low power density blocks (cache 10 W/cm 2)

that strongly impact total speed, rather than logic (30 W/cm 2)D. Frank, Power Constrained CMOS Scaling Limits, IBM J. RES. & DEV. VOL. 46 NO.2/3 MARCH/MAY 2002

Example (iso power/technology):

core 1 DD

f area = 1

=

. DD0.8 f area = 2

=

core core 1

core 2

. DD0.63 f area = 4

=

core 2

core 3


throughput = 1

throughput = 1.6

throughput = 1.6 2 = 2.5core 4


22/68

Multi-Core Scaling

Multi-core era will not last long [ISCA2011]: announced catastrophe Dark Silicon and the Endo u core ca ng ue o na equa e energy e c ency

percentage of unusable dark silicon

is growing fast


[ISCA2011] H. Esmaeilzadeh et al., Dark Silicon and the End of Multicore Scaling ISCA, June 2011


23/68



is growing fast




24/68



is growing fast

new power crisis n 2016 or processors: no reason or sca ng




25/68

A Broader View of Dark Silicon

At macro scale, dark refers to spatial dimensionAt nano scale (self-powered nodes)

inadequate energy efficiency dark silicon along thetemporal dimension (intermittent available power)

availableenergy

nooperation

normaloperation

normaloperation

nooperation

t

At meso scale (portable) dark silicon in both spatial (power constraint 1-2 W) and


temporal dimension (limited lifetime @ given functionality)


26/68


27/68

Green IC Group

Painting silicon green: mission of Green IC groupwww.green-ic.org

meso

nano



28/68

Aggressive Voltage Scaling, Minimum-Energy



29/68

Voltage Scaling: Dynamic Energy

If dynamic energy per clockdominates:

2

DDSW dyn V C E

affected by switching activity , capacitance , voltage

reduce V DD as much as possible

energy reduction limited by V DD,min (functional/timing failures)E

E dyn


V DDV DD,min


30/68

Voltage Scaling: Leakage Energy

If leakage (static) energy per cycle dominatesCK o DDlk T I V E

affected by supply voltage, leakage , clock cycle ( logic depth*gate delay)

V DD reduction and trends

D Q

clk

Reg comb 1 D Q

clk

Reg comb 2 D Q

clk

Reg comb n D Q

clk

Reg...

stage 1 stage 2 stage n

E

linear

constantexponentially growing

E lkg exponentially increasesE lkg


at low V DD V DDV DD,min


31/68

Voltage Scaling: Total Energy

Total energy vs. V DD tradeoff between E dyn and E lkg minimum-energy point (MEP) exists

E

E TOT

E dynE lkg

V DDV DD,min V DD,opt


MEP determined by optimal balance of E dyn and E lkg


32/68

Importance of Voltage Scaling: Broader View

Minimum-energy operation for better (10X) energyefficiency + circuit/architectural/SW integration permit performance increase at macro scale

reduces battery size

and lifetime at meso/nano

o age sca ng s power u intrinsic in Dennard scaling

.os - ennar sca ng aggressive voltage scaling: do it by yourself

as much as ossible ive u somethin variable workload


deal with related issues


33/68

Ultra-Low Voltage (ULV) Operation:

Energy reduction comesLimits and Challenges

at a price

per ormance

leakage energy

resiliencylkg

ener

performance

ield

failurerate

yield


, DD


34/68


35/68

Limits and Challenges

Resiliency degraded at ULV rocess/volta e/tem erature

!%

!+

$ % & ' ( ) * + , - . / * + * , -

% ) - $

3 3

. / !

5-10X more process variations

(delay: easily 2X variations)%

'%% &%% )%% *%% !%%%

4 '

5 # &

! "" 1)$2M. Alioto(TCAS-I 2012)

5X higher sensitivity to V DD

p r o c e s s

v o l t a

g e

e m p e r a t u r e

design margining

-

nominal margin

t

R. Krishnamurthy (Micro 2012)

at near threshold, easily 2X margin

(in speed binning, many discarded )


performance/energy efficiency


36/68

Limits and Challenges

Aging (depends on history, workload, voltage, temperature)

Soft errors nominal margin

.

higher failure rate at ULV

Degraded functionality at ULV V DD,min increase due tovariations 8

9 v t

1 4 v t

3 5 0 m

V

DD

degraded I on/ I off (incomplete switching)

MEM arrays: much less scalability (0.6-0.7 V)V DD,min increase due to intrinsic

NMOS/PMOS imbalance

V DD,min increase due to residualPUN/PDN imbalance 0

. 5 v t

2 . 5

v t

v t

1 3

3 2 5


2

M. Alioto (TCAS-I 2012)


37/68

Opportunities to Improve Energy



38/68

Near-Threshold ICs

Parallelism compensates speed loss enhanced b 3D chi stackin

near res o compu ng very prom s ng: e c ency enables data center scalabilityD. Blaauw/D. S lvester

can enable exascale

computing by 2020

(Shekhar Borkar, Intel)


.


39/68

Near-threshold computers will be different lo ic/MEM: different scalin (MEM becomes faster)

less cache levels, bigger cache

better logic/MEM coupling through 3D integration

More efficient and scalable microarchitectureseep p pe ne : ower ea age energy

D Q

clk

Reg comb 1 D Q

clk

Reg comb 2 D Q

clk

Reg comb n D Q

clk

Reg...

stage 1 stage 2 stage n

CK off DDlkg T I V E

ultra-low power = high speed only 17FO4/stage in 1,024-point complex FFT, 4X lower energy


(D. Blaauw, D. Sylvester ISSCC 2011)


40/68


41/68

Finer-grain voltage domains currentl : cores share same volta e, different fre uenc


slower cores might operate at lower voltage ( E op, P lkg ) not possible (share same voltage)

multiple on-chip regulators on sight r e g u

l a t o r

r e g u

l a t o

g u

l a t o r

g u

l a t o r

different frequencies

can exploit workload reduction to further

r r

r e g u

l a t o r

r e g u

l a t o r


reduce E op and P lkg


42/68

Enhance Energy Efficiency: Heterogeneity

Exploit heterogeneity (different scaling at ULV) area is commodit : ive

R. Krishnamurthy (Micro 2012)

up flexibility for better

efficiency

HW accelerators(media, image, crypto,

ra o, ,

same function in different IPs

- -. more extreme: use different

replicas with different variationsmodule 1

module 2


energy efficiency more testing delayusemodule 2usemodule 1


43/68

Enhance Energy Efficiency

Limit communication energy ex loit localit at different scales

limit off-chip (2-10X intrachip)

limit intra-chip (1-10X computation)

B. Dally (CICC 2012) restrict data structure and flow (SIMD)

better Flip-Flops (post-silicon tuning)3

-. . better clock domain design

clock slope optimization: 35% better


energy efficiency [Alioto TCAS-I 2010] M. Alioto (ISSCC 2012)


44/68

Margin Elimination: Design vs. Testing Time

Uncertainty margin at design time is too expensive post-silicon (self)tuning absolutely needed

eliminate margin: optimally allocate cost/design effort at

design/ testing / boot / run timeincrease design margin, improve

understanding/modeling, more robust

complexity , uncertainty design...

ckts people, architects and

testing people need topost-silicon

tune at testingtime, adapt at


play in the same field decisions relatedto design timedecisions related totesting / boot / run time


45/68

Margin Elimination: Timing Error DetectionReduce/eliminate worst-casemargin by catching delay faults

nominal margin

correct at run-time, tune to compensate actual variationsrun-time testing improves energy efficiency

n-s tu mon tor ng no margin

au t pre ct on (Tunable Replica Circuit) needs some margin (false positives,mimics only critical path)


invasive, limited tuning little invasive, tuning required, low overhead


46/68

Margin Elimination: Timing Error DetectionTiming monitoring: some circuit approaches

double sampling transition detection

Razor

Razor II(Umich)

(Umich)

DSTB(Intel)

TDTB(Intel)

to architecture through OR tree hold-time/detection window (TD)


metastability in data (Razor)/error path (others)


47/68

Margin Elimination: Error Correction

Faults can be corrected at various levelsfaster correctionSW Architecture Microarchitecture Circuit

Circuit Microarchitecture Architecture SW

less HW resources

SW Architecture Microarchitecture Circuit

lower energy/performance penalty

energy overheadenergy of traditionalmargined design

t h r o u g h p u t

d e g r a d a t i o n d u e t o

i n c r e a s e d e r r o r r a t e energy reductionhrou h m ar in p

u t / I P C

e

op

correction

E throughput

rateerror energy reduction belowPoFF

error rateincreasebelow PoFF

elimination

minimum energyunder error det./corr.

e n e r g y

t h r o u g h

e r r o r r a

t


V DDmargined V DD(traditionaldesign)

o n t oFirst

Failure(PoFF)

energy-optimum

V DD


48/68

Margin Elimination: Error Correction

Existing approaches circuit clock gating (Umich) clock stretching (Georgia Tech) error propagation within a clock cycle (very hard)

microarchitecture counterflow pipelining (Umich) micro-rollback Umich Bubble Razor (Umich) interferes with microarchitecture/

c cle-based timin architecture instruction re-execution (Intel), simple, large


checkpoint-restart (Wisc), simple, very large penalty


49/68

The Next Step: Sub-Cycle Detection/Correction

Existing approaches are cycle-based

from J. Crop et al.,JLPEA, 2011

correction interferes with microarchitecture (design effort) errors affect timing at boundary: difficult SoC integration large energy penalty in high error rate regime (future)

Our visionsub-cycle detection/correction

errors detected/corrected in the same cycle or, at least, errors do not have to ro a ate to the boundar


so that errors are confined and determine low energy penalty


50/68

Approximate Computing as Extreme Scaling

Some apps do not need to have perfect computation aggressively push voltage and tolerate errors

approximate computing (voltage overscaling by N. Shanbhag,K. Ro

ex.: multimedia (occasionally wrong pixels/samples)

errors not corrected on the fl

rather, avg error rate kept within bound (slow correction loop)

degradation of signal quality can be dynamically adjusted(application level)



51/68

Our Approach: User Experience-Centric Design

Voltage/energy reduction in portable multimedia for a iven ualit of user ex erience

20

40

tight link between circuit and final user errors are acceptable

20 40 60 80 100 120 140 160

60

80

100

120

140

PSNR=24 dB

metrics for quality of user experience (PSNR)close circuit design loop at application level

20

40

60

80

100

120

minimize energy for given quality

20 40 60 80 100 120 140 160

140

PSNR=36 dB

energy scalability : reduce energy

if lower quality is accepted


ynam c sca ng


52/68

Limits of recent work on energy scalability (SRAM)

[Wolf2009], [Kurdahi2008] : aggressive V DD scaling to reduceenergy at the cost of higher BER

very limited voltage/energyBER

(or PSNR) BERenergy

sca a y : exp DD

abruptly increases targetedquality

same limitation in mixed 6T/8T SRAM Ro 2011

V DD

near threshold, 6T array almost always fails, 8T almost never fails


not really scalable either


53/68

Our approach errors have different impact depending on where theyoccur

optimal energy allocation: protect (=spend energy) only

important bits to have graceful degradation (various knobs) when limiting precision, use

unused bits to improve resiliency

can pus more on DD to re uceenergy at same quality


current y, -nm c p un er test


54/68


55/68

Results in 28-nm 32-kb SRAM, YUV format

(QCIF 144x176)

Akiyo video, frame #30

PSNR w.r.t. voltage scaling

41% better PSNR(dB) at same energy

20

40

A20

40

B20

40

Original

60

80

100

120

60

80

100

120

60

80

100

120


20 40 60 80 100 120 140 160

14020 40 60 80 100 120 140 160

140

20 40 60 80 100 120 140 160

140


56/68

Other Opportunities

Enable burst very high-speed computation ust violate reliabilit constraint

temporarily exceed Thermal Design Power

leverage thermal cap for DVFS Turbo Boost

2.0 [Intel, Rotem et al., HOTCHIPS 2011]

enhance thermal cap via phase change materials

Com utational S rintin Ra havan HPCA 2012



57/68

Our Vision of Distributed Power Management

Globally green systems ener -efficient, widel ener scalable

and externally tunable components

need for communication (energy state, knob tuning)

global policies based on information on energy stateENERGY

MANAGEMENTCHANNEL

TRADITIONALCOMMUNICATION

CHANNEL

REG

inputs

self-adjustinternal

EX.: bus, NoC, crossbar...

EX.: throughput,arithmetic precision...

instantaneous requirements ments

sensors

knobs tominimize

energy

settings processing

added to enable energyscalability and dynamic

tradeoff with other assets

MODULE

energy-related parametersmeters


outputsEX.: timing slack, bit error rate...


58/68

Our Vision of Distributed Power Management keep it simple (integration), yet maintain global view:

hierarchical structure

h e r

l e v e

l i n

h i e r a r c

h y

h i

ena es remo e power managemen g o a v ew anintelligence kept out of nano-scale nodes)

move computation where more efficient (computation vs.


communication, locality, heterogeneity)


59/68

Beyond-CMOS Ultra-Low Voltage Circuits



60/68

Tunnel-FETs: a Very Promising Alternative Main limit to voltage scaling of CMOS transistor

V TH can be reduced only if

su res o s ope s

lowered at given leakage

use new devices with

ower su t res o s ope

Tunnel FETs : very promising (ITRS: after 2020)

Physical structure p+in+


metal


61/68

Tunnel-FETs: Robustness Comparison Comparison with CMOS bulk (FinFET) / SOI

fair: all optimized for ULV, same targets (leakage)

Noise margin degradation at ULV linear '%%

'+%! ""0123 167862 " 9: )$! ""0123 1;


62/68


63/68

Tunnel-FETs: Energy Comparison FO 4 inverter chain (10% activity, 16 slices)

min. energy vs. logic depth max. TFET advantage

w.r.t. SOI 35% @ 60FO4%")%"*

!!"'!"&!")

( ) * + , - . / 0 ' 6 7 8 6

. . .%

%"'"

'% &% )% *% !%% !'% !&% !)% !*% '%% 7

+ 8 *

& '

+'E,F /.G0H 9"#/D3 1


64/68

Tunnel-FETs: Energy Comparison Impact of transistor stacking

at ULV, leakage reduction in 2-4 stacked TFETs is 5-8X better than SOI, 3-6X better than bulk

at ULV, I on reduction in 2-4 stacked TFETs is u

TFET cells with larger fan-in provide more benefits faster, lower leakage lower min. energy

TFET standard cell libraries must include higher fan-in cells

Example: zero-detector with 4-input gates min. energy improved by 1.79X (1.84X) w.r.t. to SOI (bulk)



65/68

Tunnel-FETs: SRAM cell System voltage scalability limited by SRAM cell

small margins, sensitive to variations

8T cell

about same area (33 X 13.4 F 2)

TFET SNM scales better '%%'+%

1 ) $ 2

%

+%

!%%

!+%

( . * / ) * ( E , &

: 4

TFET SOI bulk

V DD>140 mV 30% V DD 35% V DD 30% V DD


!%% !&% !*% ''% ')% (%% (&% (*% &'% &)% +%%! "" 1)$2

F748


66/68

Conclusions Future computing platforms (macro, meso, nano)

Green: energy efficiency is key in any component

Ultra-low voltage is really challenging speed, leakage, resiliency (design margin)

Opportunities to overcome challenges margin reduction

heterogeneity

fine-grain/independent power domains

coor na e arc ec ure c rcu es gn use better devices



67/68



68/68

Speakers Contacts

E-mail [email protected]

Massimo Alioto, Ph.D.

[email protected]

. - .

ECE De artmentNational University of Singapore (NUS)

4 Engineering Drive 3, Singapore 117576


Documents

L3 - System on Chip