1 ISCA 2004 Tutorial Thermal Issues for Temperature-Aware Computer Systems Saturday, June 19 th...

Preview:

Citation preview

1

ISCA 2004 Tutorial

Thermal Issues for Temperature-Aware Computer

Systems

Saturday, June 19th

8:00am - 5:00pm 

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

2

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Presenters:

Kevin Skadron (skadron@cs.virginia.edu)CS Department, University of Virginia

Mircea Stan (mircea@virginia.edu)ECE Department, University of Virginia

David Brooks (dbrooks@eecs.harvard.edu)CS Department, Harvard University

Antonio Gonzalez (antonio@ac.upc.es)UPC-Barcelona, and Intel Barcelona Research Center

Lev Finkelstein (lev.finkelstein@intel.com)Intel Haifa

3

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Overview

1. Motivation (Kevin)2. Thermal issues (Kevin)3. Power modeling (David)4. Thermal management (David)5. Optimal DTM (Lev)6. Clustering (Antonio)7. Power distribution (David)8. What current chips do (Lev)9. HotSpot (Kevin)

4

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Overview

1. Motivation (Kevin)2. Thermal issues (Kevin)3. Power modeling (David)4. Thermal management (David)5. Optimal DTM (Lev)6. Clustering (Antonio)7. Power distribution (David)8. What current chips do (Lev)9. HotSpot (Kevin)

5

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Motivation

• Power consumption: first-order design constraint unconstrained power is a theoretical max peak (inst.) power is limiting power delivery (dI/dt) sustained power limits thermal design/packaging max sustained power: thermal “virus”

same as thermal design power average active power and idle power limit mobile

battery life, etc. Common fallacy: instantaneous power temperature

• Power-density is increasing even faster: thermal effects become more problematic.

Moore’s Law: exponential increase Need Power/Temperature-aware computing!

6

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Power density

From PACT 2000 keynote; source: Intel website

But this curve is flattening

7

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Power-aware figures of merit

• Power (P): battery time (mobile) packaging (high-performance)

• Energy (PD): battery life (mobile) fundamental limits (kT)

• Energy-delay (PD^2): performance and low power

• Energy-delay^2 (PD^3): emphasis on performance

Power-aware low powerSimilar to “old” VLSI complexity (A, AD, AD^2)None of these are appropriate for thermal

Refs: R. Gonzales et al. “Supply and threshold voltage scaling for low power CMOS”, JSSC, Aug. 1997

A. Martin et al. “Design of an Asynchronous MIPS R3000”, ARVLSI’97J. Ullman, “Computational aspects of VLSI”, CS Press, 1984

8

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Cooking-aware computing

Boiling water will come soon

9

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Power and temperature are BAD

• and can be EVIL

Source: Tom’s Hardware Guidehttp://www6.tomshardware.com/cpu/01q3/010917/heatvideo-01.html

10

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Overview

1. Motivation (Kevin)2. Thermal issues (Kevin)3. Power modeling (David)4. Thermal management (David)5. Optimal DTM (Lev)6. Clustering (Antonio)7. Power distribution (David)8. What current chips do (Lev)9. HotSpot (Kevin)

11

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Thermal issues

Temperature affects:• Circuit performance• Circuit power (leakage)• IC reliability• IC and system packaging cost• Environment

12

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Performance and leakage

Temperature affects :

• Transistor threshold and mobility

• Subthreshold leakage, gate leakage

• Ion, Ioff, Igate, delay

• ITRS: 85°C for high-performance, 110°C for embedded!

IonNMOS

Ioff

13

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Temperature-aware circuits

• Robustness constraint: sets Ion/Ioff ratio

• Robustness and reliability: Ion/Igate ratio

Idea: keep ratios constant with T: trade leakage for performance!

Ref: “Ghoshal et al. “Refrigeration Technologies…”, ISSCC 2000Garrett et al. “T3…”, ISCAS 2001

14

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Resulting performance

25% - 30% extra performance (110oC to 0oC)

regularTAC

15

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Reliability

The Arrhenius Equation: MTF=A*exp(Ea/K*T)

MTF: mean time to failure at TA: empirical constantEa: activation energy

K: Boltzmann’s constantT: absolute temperature

Failure mechanisms:Die metalization (Corrosion, Electromigration, Contact spiking)Oxide (charge trapping, gate oxide breakdown, hot electrons)Device (ionic contamination, second breakdown, surface-charge)Die attach (fracture, thermal breakdown, adhesion fatigue)Interconnect (wirebond failure, flip-chip joint failure)Package (cracking, whisker and dendritic growth, lid seal failure)

Most of the above increase with T (Arrhenius)Notable exception: hot electrons are worse at low temperatures

16

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Arrhenius or Erroneous?

“Hot” issue in thermal community: is the Arrhenius equation correct/relevant?

C. Lasance (Philips): “Erroneous” equation• Claim: what really matters are thermal gradients

in space and time, thermal cycling

• Will not solve the dispute here!• Agreement: thermal issues are key for reliability,

whether static or dynamic

Another famous quote: “We have a headache with Arrhenius” (T. Okada, Sony, when asked about reliability prediction methods)

17

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Packaging cost

From Cray (local power generator and refrigeration)…

Source: Gordon Bell, “A Seymour Cray perspective”http://www.research.microsoft.com/users/gbell/craytalk/

18

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Packaging cost

To today…• Grid computing: power plants co-located near

compute farms• IBM S/390:refrigeration

Source: R. R. Schmidt, B. D. Notohardjono “High-end server low temperature cooling”IBM Journal of R&D

19

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

IBM S/390 refrigeration

• Complex and expensive

Source: R. R. Schmidt, B. D. Notohardjono “High-end server low temperature cooling”IBM Journal of R&D

20

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

IBM S/390 processor packaging

Processor subassembly: complex!C4: Controlled Collapse Chip Connection (flip-chip)

Source: R. R. Schmidt, B. D. Notohardjono “High-end server low temperature cooling”IBM Journal of R&D

21

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Intel Itanium packaging

Complex and expensive (note heatpipe)

Source: H. Xie et al. “Packaging the Itanium Microprocessor”Electronic Components and Technology Conference 2002

22

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

P4 packaging

• Simpler, but still…

Source: Intel web site

23

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Environment

• Environment Protection Agency (EPA): computers consume 10% of commercial electricity consumption– This incl. peripherals, possibly also manufacturing– A DOE report suggested this percentage is much lower– No consensus, but it’s still a lot

• Equivalent power (with only 30% efficiency) for AC• CFCs used for refrigeration• Lap burn• Fan noise

24

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Heat mechanisms

• Conduction• Convection• Radiation• Phase change• Heat storage

25

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Conduction

• Similar to electrical conduction (e.g. metals are good conductors)• Heat flow from high energy to low energy• Microscopic (vibration, adjacent molecules, electron transport)• No major displacement of molecules• Need a material: typically in solids (fluids: distance between mol)• Typical example: thermal “slug”, spreader, heatsink

Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001

A

26

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Conduction

Different materials(not a strongfunction oftemperature)Si – more variation

Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001

27

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Convection

• Macroscopic (bulk transport, mix of hot and cold, energy storage)

• Need material (typically in fluids, liquid, gas)• Natural vs. forced (gas or liquid)• Typical example: heatsink (fan), liquid cooling

Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001

28

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Radiation

• Electromagnetic waves (can occur in vacuum)• Negligible in typical applications• Sometimes the only mechanism (e.g. in space)

Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001

29

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Surface-to-surface contacts

• Not negligible, heat crowding• Thermal greases (can “pump-out”) • Phase Change Films (undergo a transition from solid to

semi-solid with the application of heat)

Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001

30

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Phase-change

Thermal solutions evolution:• Natural air cooling• Forced-air cooling• Liquid cooling• Phase change (e.g. heat pipe)• Refrigeration

Phase change:

a. Solid changing to a liquid—fusion, or melting,

b. Liquid changing to a vapor—evaporation, also boiling,

c. Vapor changing to a liquid—condensation,

e. Liquid changing to a solid—crystallization, or freezing,

f. Solid changing to a vapor—sublimation,

g. Vapor changing to a solid—deposition.

31

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Thermal capacitance

• Example:

(Aluminum) = 2,710 kg/m3

Cp(Aluminum) = 875 J/(kg-°C)V = t·A = 0.000025 m3

Cbulk = V·Cp· = 59.28 J/°C

32

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Refrigeration

“conventional” vs. thermo-electric (TEC)• Can get T < T_amb (“negative” Rth!)TEC: Peltier effect (can use for local cooling)

33

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

TEC electro-thermal model

34

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Simplistic steady-state model

All thermal transfer: R = k/A

Power density matters!Ohm’s law for thermals

(steady-state)

V = I · R -> T = P · R

T_hot = P · Rth + T_amb

Ways to reduce T_hot:

- reduce P (power-aware)

- reduce Rth (packaging)

- reduce T_amb (Alaska?)

- maybe also take advantage of transients (Cth)

T_hot

T_amb

35

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Simplistic dynamic thermal model

Electrical-thermal duality V temp (T) I power (P) R thermal resistance (Rth) C thermal capacitance (Cth)RC time constant

KCLdifferential eq. I = C · dV/dt + V/Rdifference eq. V = I/C · t + V/RC · tthermal domain T = P/C · t + T/RC · t(T = T_hot – T_amb) One can compute stepwise changes in

temperature for any granularity at which one can get P, T, R, C

T_hot

T_amb

36

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Combined package model

Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001

Steady-state

Tj – junction temperature

Tc – case temperature

Ts – heatsink temperature

Ta – ambient temperature

37

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Itanium package model

Example: processor + 4 cache modules

Source: H. Xie et al. “Packaging the Itanium Microprocessor”Electronic Components and Technology Conference 2002

38

© M

irce

a St

an, K

evin

Ska

dron

, Dav

id B

rook

s, 2

002

Thermal issues summary

• Performance, power, reliability• Architecture-level: conduction only• Convection: too complicated• Radiation: can be ignored

• Use compact models for package• Power density is key

Recommended