53
® 1 Exponential Challenges, Exponential Rewards— The Future of Moore’s Law Based on lecture of Shekhar Borkar Intel Fellow Circuit Research, Intel Labs

® 1 Exponential Challenges, Exponential Rewards— The Future of Moore’s Law Based on lecture of Shekhar Borkar Intel Fellow Circuit Research, Intel Labs

Embed Size (px)

Citation preview

RR

® 1

Exponential Challenges, Exponential Rewards—

The Future of Moore’s Law

Based on lecture of Shekhar Borkar

Intel Fellow

Circuit Research, Intel Labs

2

ISSCC 2003—Gordon Moore said…

“No exponential is forever…

But

We can delay Forever”

3

Goal: 1TIPS by 2010

1970 1975 1980 1985 1990 1995 2000 2005 20100.01

0.10

1.00

10.00

100.00

1,000.00

10,000.00

100,000.00

1,000,000.00

MIP

S

Pentium® Pro Architecture

Pentium® 4 Architecture

Pentium® Architecture

486386

2868086

How do you get there?How do you get there?

4

Transistors Scaling

Will high K happen? Would you count on it?Will high K happen? Would you count on it?

5

Technology ScalingGATE

SOURCE

BODY

DRAIN

Xj

ToxD

GATE

SOURCE DRAIN

Leff

BODY

Dimensions scale down by 30%

Doubles transistor density

Oxide thickness scales down

Faster transistor, higher performance

Vdd & Vt scaling Lower active power

Technology has scaled well, will it in the future?Technology has scaled well, will it in the future?

6

Gate Oxide is Near Limit

70 nm

Si3N4

CoSi2130nm Transistor

Will high K happen? Would you count on it?Will high K happen? Would you count on it?

GATE

SOURCE

BODY

DRAINTox

GATE

SOURCE DRAIN

70 nm BODY

7

3D-Gate Transistor

8

Transistor Integration Capacity

10 7 5 3 2 1.5 1 0.7 0.50.35

0.25

0.18

0.13

0.09

0.065

0.045

0.001

0.01

0.1

1

10

100

1000

Technology (m)

Tra

ns

isto

rs (

Mil

lio

n) 1 Billion

On track for 1billion transistor integration capacityOn track for 1billion transistor integration capacity

9

35 Years of Microprocessor Trend

C Moore, Data Processing in ExaScale-Class Computer Systems, Salishan, April 2011C Moore, Data Processing in ExaScale-Class Computer Systems, Salishan, April 2011

10

Transistor Integration Capacity

11

Transistor Integration Capacity

12

Transistor Integration Capacity

13

Transistor Integration Capacity

14

Exponential Challenge #1

15

Is Transistor a Good Switch?

On

I = ∞

I = 0

Off

I = 0

I = 0

I ≠ 0

I = 1ma/u

I ≠ 0

I ≠ 0Sub-threshold Leakage

16

Sub-threshold Leakage

Sub-threshold leakage increases exponentiallySub-threshold leakage increases exponentially

30 40 50 60 70 80 90 100 110 120 1301

10

100

1000

10000

Temp (C)

Ioff

(n

a/u

)

0.25u

45nm

Assume:

0.25mm, Ioff = 1na/m5X increase each generation at 30ºC

17

Leakage Power

1.5 1 0.7 0.5 0.35 0.25 0.18 0.13 0.09 0.065

0.045

0%

10%

20%

30%

40%

50%

Technology (m)

Le

ak

ag

e P

ow

er

(% o

f T

ota

l)

Must stopat 50%

Leakage power limits Vt scalingLeakage power limits Vt scaling

A. Grove, IEDM 2002

18

The Power Crisis

0.25u 0.18u 0.13u 90nm 65nm 45nm0

200

400

600

800

1000

1200

Leakage

Active

Po

we

r (W

)

15 mm Die

19

How Power Should Have Scaled

A. Danowitz et al. CPU DB: Recording Microprocessor History. ACMQueue Processors, vol. 10, issue 4, pp1-18. 2012A. Danowitz et al. CPU DB: Recording Microprocessor History. ACMQueue Processors, vol. 10, issue 4, pp1-18. 2012

20

Exponential Challenge #4

21

Impact on Path Delays

Path Delay

Path delay variability due to technological variationsImpacts individual circuit performance and power

Optimize each circuit for performance and powerOptimize each circuit for performance and power

Delay

Pro

bab

ility

Due to variations in:Vdd, Vt, and Temp

22

Impact on Path Delays

Path Delay

Path delay variability due to technological variationsImpacts individual circuit performance and power

Optimize each circuit for performance and powerOptimize each circuit for performance and power

Delay

Pro

bab

ility

Due to variations in:Vdd, Vt, and Temp

How many silicon atoms (111pm) have on transistor channel (20nm)? 3D transistor is a solution?

23

Shift in Design ParadigmShift in Design ParadigmFrom deterministic design to

probabilistic and statistical design–A path delay estimate is probabilistic (not

deterministic)

Multi-variable design optimization for– Parameter variations– Active and leakage power– Performance

24

Exponential Challenge #6

25

Exponential Costs

1960 1970 1980 1990 2000 2010$10

$100

$1,000

$10,000

$100,000

Lit

ho

To

ol

Co

st (

$K)

G. MooreISSCC 03

Litho Cost

$1

$10

$100

$1,000

$10,000

1960 1970 1980 1990 2000 2010

Fab

Co

st (

$M)

www.icknowledge.com

FAB Cost

1965 1970 1975 1980 1985 1990 1995 2000 20051E-06

1E-05

1E-04

1E-03

1E-02

1E-01

$/T

ran

sist

or

$ per Transistor

1965197019751980198519901995200020051E-02

1E-01

1E+00

1E+01

1E+02

1E+03

1E+04

$/M

IPs

$ per MIPS

26

Some ImplicationsTox scaling will

slow down—may stop?

Vdd scaling will slow down—may stop?

Vt scaling will slow down—may stop?

Approaching constant Vdd scaling

Energy/logic op will not scale

10

7 5 3 2 1.5

1 0.7

0.5

0.35

0.25

0.18

0.13

0.09

0.065

0.045

0.1

1

10

100

Technology (m)

Vd

d (

Vo

lts

)

~1 Volt

10 7 5 3 2 1.5

1 0.7

0.5

0.35

0.25

0.18

0.13

0.09

0.065

0.045

1E-081E-071E-061E-051E-041E-031E-021E-011E+00

Technology (m)

Ene

rgy/

Logi

c O

pera

tion

(N

orm

aliz

ed)

Slow Down?

27

The Terascale Dilemma

Many billion transistor integration capacity will be available– But could be unusable due to power

Logic transistor growth will slow down

Transistor performance will be limitedSolutionsLow power design techniques Improve design efficiency

28

Exponential Challenge #5

29

Platform Requirements

0

500

1000

1500

2000

2500

3000

PC tower Mini tower -m tower Slim line Small pcSys

tem

Vo

lum

e (

cub

ic i

nch

)

Shrinking volume

Quieter

Yet, High Performance

0

0.5

1.0

1.5

0 50 100 150 200Power (W)

Th

erm

al

Bu

dg

et

(oC

/W)

0

25

50

75

Hea

t-S

ink

Vo

lum

e (

in3)

Projected Heat Dissipatio

n Volume

Projected Air Flow Rate

Pentium ® III

100

250

Thermal Budget Air

Flo

w R

ate

(C

FM

)

Pentium ® 4

Thermal budget decreasing

Higher heat sink volume

Higher air flow rate

30

Active Power Reduction

Slow Fast Slow

Lo

w S

up

ply

V

olt

ag

e

Hig

h S

up

ply

V

olt

ag

e

Logic BlockFreq = 1Vdd = 1Throughput = 1Power = 1Area = 1 Pwr Den = 1

Vdd

Logic Block

Freq = 0.5Vdd = 0.5Throughput = 1Power = 0.25Area = 2Pwr Den = 0.125

Vdd/2

Logic Block

Multiple Vdd

Throughput oriented design

31

Design & mArch Efficiency

S-Scalar Dynamic Deep Pipe-line

0

1

2

3

4

Die AreaPerformancePower

Gro

wth

(X

) fr

om

pre

vio

us

uA

rch

Same Process Technology

S-Scalar Dynamic Deep Pipe-line

0%

20%

40%

Re

du

cti

on

in

MIP

S/W

att Same Process Technology

Enegry efficiency drops ~20%

Employ efficient design & mArchitecturesEmploy efficient design & mArchitectures

Improve mArch Efficiency

ST Wait for Mem

MT1 Wait for Mem

MT2 Wait

MT3

Single Thread

Multi-Threading

Thermals & Power Delivery designed for full HW utilization

Multi-threading improves performance without impacting thermals & power delivery

Multi-threading improves performance without impacting thermals & power delivery

Computer Architecture: A Quantitative Approach (Hennessy;Patterson, 2011)

Computer Architecture: A Quantitative Approach (Hennessy;Patterson, 2011)

33

Increase on-die Memory

0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.10m0%

20%

40%

60%

80%

100%

PentiumPentium ProPentium II Pentium III

Pentium III & 4

Pentium ® 4

Cache % of full chip area

?

Large on die memory provides:

1. Increased Data Bandwidth & Reduced Latency

2. Hence, higher performance for much lower power

0.25m 0.18m 0.13m 0.1m1

10

100

Logic

Memory

Po

we

r D

en

sit

y (

Wa

tts

/cm

2)

34

Chip Multi-Processing

Keynote presentation (L. Benini, RSP 2010).Keynote presentation (L. Benini, RSP 2010).

35

Chip Multi-Processing

1 1.5 2 2.5 3 3.5 41

1.5

2

2.5

3

3.5

Die Area, PowerR

elat

ive

Per

form

ance

CMP

ST

C1 C2

C3 C4

Cache

• Multi-core, each core Multi-threaded• Shared cache and front side bus• Each core has different Vdd & Freq• Spreading hot spots• Lower junction temperature

36

Example (Itanium Tukwila)

37

Example (Itanium Tukwila)

30 MBytes cache

130 Watts

38

Example (Itanium Tukwila)

39

What the Cores Will look like?

40

What the Cores Will look like?

41

What the Cores Will look like?

42

What the Cores Will look like?

 clocks run with the same frequency but unknown phases

43

What the Cores Will look like?

44

What the Cores Will look like?

• Intelligent redistribution workload

• Improvement of energy efficiency

• Multiple functionalities

45

What the Cores Will look like?

• Several interconnection possibilities• Mesh• Ring

46

Tera-Scale

RMS - Recognition, Mining and Synthesis

47

Tera-Scale

48

Tera-Scale

49

Tera-Scale

50

The Exponential Reward

1970 1975 1980 1985 1990 1995 2000 2005 20100.01

0.1

1

10

100

1000

10000

100000

1000000

MIP

S

Speculative, OOO

Era of Instruction

LevelParallelism

Super Scalar

486386

2868086 Era of

PipelinedArchitecture

Multi ThreadedEra of

Thread &Processor

LevelParallelism

Special Purpose HW

Multi-Threaded, Multi-Core

51

Summary—Delaying Forever

Terascale transistor integration capacity will be available - Power and Energy are the barriers

Variations will be even more prominent - shift from Deterministic to Probabilistic design

Improve design efficiencyExploit integration capacity to deliver

performance in power/cost envelope

52

1. Discuta um problema associados a integração dos dispositivos

2. Comente a afirmação: - “A redução do tamanho dos transistores muda o paradigma de avaliação de consumo de energia e tempo de execução de determinístico para probabilístico”

3. Porque o consumo de energia estático é tão problemático para as tecnologias futuras?

4. Porque a redução da voltagem é um dos principais elementos a tratar para reduzir o consumo de energia?

5. Como um sistema com várias alimentações pode contribuir para a redução do consumo de energia? Qual o efeito sobre o tempo de execução?

Exercícios

53

6. Faça uma ilustração que mostre como um programa multi-thread pode ocupar melhor os recursos de um sistema, reduzindo o gargalo de comunicação com a memória

7. Qual o motivo do percentual de memória interno a um circuito integrado passar de 50% nos processadores atuais?

8. Dada a limitação do escalamento, o que pode ser feito para continuar o crescente aumento do desempenho das máquinas?

9. Quais as tendências em termos de computação (cores), infra-estrutura de comunicação e armazenamento para os próximos processadores?

Exercícios