27
Dec-2009 Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel Corporation, Israel Ran Ginosar Technion, Israel Avi Mendelson Microsoft R&D, Israel Uri Weiser Technion, Israel

Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 1

Multiple Clock and Voltage Domains for Chip Multi

Processors

December - 2009

Efraim RotemIntel Corporation, Israel

Ran GinosarTechnion, Israel

Avi Mendelson Microsoft R&D, Israel

Uri WeiserTechnion, Israel

Page 2: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 2

Compute Performance matters

1978 1982 1986 1990 1994 1998 2002 20061

10

100

1,000

10,000

Source: Dave Patterson

Fueled by a combination of process and arch

We would like to keep on providing performance – Power is #1 limiterBoth process technology and ILP slow down multi core architectures

1W

10W

100WAn order of magnitude more power efficient but deep in the power wall

Chip with Multiple Clock and Voltage Domains

Page 3: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 3

Work Overview - scope• How to best architect and manage Clock and voltage

domains of a CMP to max performance under power constraints

• 16 core Power constrained CMP• 1 thru 16 voltage regulators (VR)

– Either on chip or off chip VR

• 1 thru 16 clock domains – FIFO buffers increase latency

• Paper contributions:– Power delivery constrains DVFS

• Multi-voltage domains not so easy

– Methodology to evaluate CMP workloads

– Clustered voltage and clock domains

PE #1

PE #2

PE #n

PMU

Cache

FIFO Buffer

Core#1

PMU

L2Cache

I/O and Memory

Inte

rco

nn

ect

Core#2

Coren#

DC/DC

DC/DC

DC/DC

VR

VR

VR

CPU

PE #1

PE #2

PE #n

PMU

Cache

FIFO Buffer

Core#1

PMU

L2Cache

I/O and Memory

Inte

rco

nn

ect

Core#2

Coren#

DC/DC

DC/DC

DC/DC

VR

VR

VR

DC/DC

DC/DC

DC/DC

VR

VR

VR

CPU

Page 4: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 4

Operation point and constraints• Process technology voltages

– Voltage range Vmin – Vmax

– Frequency range fmin – 2fmin

– Nominal working point Vmin , fmin

• Lower bound on quality of service– Frequency DFS down to ½ fmin

• Total power is a constraint– Not exceed nominal power

• Power delivery has been added as a constraint

• Most constraining parameter winsChip with Multiple Clock and Voltage Domains

Page 5: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 5

Why is VR a constraint? Simplified example • Given a 16 core 100A shared power delivery

– Tying all cores together allows sharing current among cores– Allow one core to consume all the current

I/16

I/16

I/16

I/16

I/16

I/16

I/16

I/16

I/16

I/16

I/16 I/16

I / 16

I / 16

I/16

I/16

• Assume we can split the same VR into 16– Allow each core a fixed 100A/16

– Sharing is not possible– Keeping capability requires 1,600A!

I

Core

Page 6: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009

Power delivery is constrained

• Need power delivery headroom for performance

• Replacing 1 VR by 16 individual VRs:– Does not allow current sharing between cores– Results in degraded power delivery

• New technologies:– Need less area / volume, BUT– Still deliver limited current

• More details in the paper

6

Page 7: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec - 2009 Chip with Multiple Clock and Voltage Domains 7

Modeling methodology Workload construction

Page 8: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 8

Hybrid model• Offline characterization of a real CPU:

– Instrumented Intel® Core™-2 Duo for power performance measurements

– Characterized SPEC-2K traces behavior– Extracted DVFS parameters and V/F scaling

• Cycle accurate simulation for FIFO impacts– 3 clocks each direction

• Coded analytic model to calculate performance– Function of power frequency and workload

Page 9: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 9

Workload construction• Typical Multi Threaded benchmarks

insufficient– Server or HPC centric

• Highly regular and uniform

– But client and cloud computing is non uniform• We performed Monte-Carlo simulation

– Used SPEC-2K as an application pool– Randomly assigned a subset of 16 threads to

the cores– Both fully and partially threaded studies – Performed all studies on the same workload– Repeated workload selection and analysis 200

times

Page 10: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 10

Results

Page 11: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains11

Baseline: Single Voltage and Clock DVFS

• 10-25% performance gain from use of power headroom

• Serves as baseline for the studies to follow

• 200 random workloads

• DVFS to lowest constraint

• Sorted by performance

• Shown relative performance

Baseline performance gain

100%

105%

110%

115%

120%

125%

130%

1 21 41 61 81 101 121 141 161 181

Workload

Per

form

ance

[re

lati

ve t

o b

ase

freq

uen

cy]

Baseline

20 40 60 80 100 120 140 160 180 200

100% = 16XGalgel

140% = 16XCrafty

I

Core

Page 12: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 12

Different topologies - Fully threaded workloads

• Example with power supply capability of 150%

• Some workloads gain performance, some lose compared to baseline

– In contrast with previous studies – Assign budget asymmetrically • 200 random workloads

• Oracle study

• Three topologies vs. baseline

• Each Sorted independently

• Performance relative to baseline

Relative Performance

-6%

-4%

-2%

0%

2%

4%

6%

20 40 60 80 100 120 140 160 180

Workloads (sorted)

Rel

ativ

e p

erfo

rman

ce [

%]

nVnC / 1V1C

1VnC / 1V1C

nVnC / 1VnC

Relative Performance

-6%

-4%

-2%

0%

2%

4%

6%

20 40 60 80 100 120 140 160 180

Workloads (sorted)

Rel

ativ

e p

erfo

rman

ce [

%]

nVnC / 1V1C

1VnC / 1V1C

nVnC / 1VnC

50% appsLoose perf

50% appsbetter perf

1V – Single voltage domainnV – Multiple Voltage domains1C – Single Clock domainnC – Multiple Clock domains

Page 13: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009

2T 4T 8T 12T 14T 16T110%

115%

120%

125%

130%

135%

140%

145%

150%

155%

160%

Performance vs. Threads and policy 250% headroom

1V1CnVnC1VnC

Number of threads

Per

ofr

man

ce

Chip with Multiple Clock and Voltage Domains 13

Partially threaded workload

• Fewer threads higher benefit from shared power

Multi VR better

Single VR better

1C – Single Clock domainnC – Multiple Clock domains

1V – Single voltage domainnV – Multiple Voltage domainsOracle Study

Page 14: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 14

Gaining the best of both worlds: Clusters

• N clusters with 16/N cores each• Sharing VR between cores in a cluster• Setting optimal voltage frequency for each cluster

I/4

I/4

I/4

I/4

Page 15: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009Chip with Multiple Clock and Voltage Domains

15

Clusters• Clustered topology almost equal to the best of both

topologies• Outperforms both when number of threads = number of

clusters

1V – Single voltage domainnV – Multiple Voltage domains1C – Single Clock domainnC – Multiple Clock domainsxT – X Threads

Performance vs. Treads and policy250% headroom

110%

115%

120%

125%

130%

135%

140%

145%

150%

155%

160%

2T 4T 8T 12T 14T 16T

Number of threads

Pe

rofr

ma

nc

e

1V1C

nVnC

nVnC-8C-SM

Cluster always the best

Page 16: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 16

How to pick the best cluster size?• Oracle study• Compared to non-clustered (by workload)• Calculated quadratic error from best topology• Best scenarios highlighted• “Diagonal behavior”

– More constrained power delivery larger clusters

110% 130% 150% 200% 250%1V1C 7.1% 11.4% 13.2% 14.8% 16.6%1VnC 5.1% 9.0% 10.7% 12.4% 14.1%nVnC-2C 28.6% 13.0% 14.1% 15.4% 17.5%nVnC-4C 45.8% 14.7% 13.3% 12.2% 13.9%nVnC-8C 55.6% 21.9% 16.5% 9.8% 7.6%Columns – power delivery capabilityRows – number of clustersCells showing distance from Oracle (Smaller is better)

Page 17: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 17

Summary

• Power delivery is a major CPU perf. constraint– Overlooked by previous works– Multiple voltage domain do not allow power sharing– Lightly threaded workloads are most constrained

• Clustered topology mitigates sharing limitations– Allows sharing power within subsets of cores– Optimal cluster size: function of power delivery capability

• Explored the non uniform workloads– Different application types– Partially vs. fully threaded workloads

Page 18: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 18

Thank You

Page 19: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 19

Run time policies• Policy to:

– Evaluate run time parameters and select frequency

• Three control functions – Input: power or scalability

– Compute: frequency for each core• Scale each domain to lowest constrain (e.g. power delivery, max

freq)

• Calculated quadratic error from Oracle results

Input – Power / Scalability

Fre

q.

Input – Power / Scalability

Fre

q.

Input – Power / ScalabilityF

req

.

Greedy (Winner Takes All) Linear Polynomial

Linear dependency3 ParmF

Page 20: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 20

Run time policy results• Winning policy is a greedy (WTA) based on scalability

– Very close to Oracle

• Random and power based policies are not good policies

Max AverageWTA 50% 5.84% 1.3%WTA 33% 4.41% 0.6%WTA 10% 1.23% 0.0%WTA by Power 50% 22.76% 6.9%Linear by SCA 9.60% 6.1%Linear by power 49.76% 36.6%Polynomial by SCA 5.23% 3.3%Random 33.28% 19.9%

1VnCMax Average

WTA 50% 2.90% 0.8%WTA 33% 3.37% 0.8%WTA 10% 4.63% 1.7%WTA by Power 50% 4.60% 2.3%Linear by SCA 2.72% 1.5%Linear by power 5.77% 3.8%Polinomial by SCA 3.58% 1.5%Random 8.66% 4.3%

nVnC

Distance from Oracle (Smaller is better)WTA – Winner Take AllSCA - Scalability

Page 21: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 21

Workload characterization

• Measured score at two frequencies

• Measured total CPU power– Scaled power =

(Workload Power)/(Max Power)– Results 33%-100%

• leakage + Idle is ~30%

– Most applications use less than 100% power

• Even at Vmax , fmax they consume less than Imax

• Reason: Not all parts of the CPU are utilized

• Scalability = ΔPerf/ΔFrequency– Result 0%-100%

• Low Memory bound• High CPU bound

A

A

B C

B

 SPEC intScaled Power

Perf. Scaling

with freq.FIFO

impact

gzip 48% 0.95 0.13%

vpr 44% 0.68 2.92%

gcc 35% 0.67 0.92%

mcf 49% 0.30 2.92%

crafty 33% 0.99 0.59%

parser 60% 0.78 1.29%

eon 42% 0.99 0.00%

perlbmk 50% 1.00 0.31%

gap 45% 0.56 1.14%

vortex 60% 0.73 1.45%

bzip2 49% 0.70 0.71%

twolf 97% 0.99 4.68%

Int_rate 51% 0.77 1.42%

Page 22: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 22

Workload characterization

• Used cycle accurate simulation to evaluate FIFO impact / application

A B CC

 SPEC intScaled Power

Perf. Scaling

with freq.FIFO

impact

gzip 48% 0.95 0.13%

vpr 44% 0.68 2.92%

gcc 35% 0.67 0.92%

mcf 49% 0.30 2.92%

crafty 33% 0.99 0.59%

parser 60% 0.78 1.29%

eon 42% 0.99 0.00%

perlbmk 50% 1.00 0.31%

gap 45% 0.56 1.14%

vortex 60% 0.73 1.45%

bzip2 49% 0.70 0.71%

twolf 97% 0.99 4.68%

Int_rate 51% 0.77 1.42%

All studies are average over the entire run, not accounting for variance over timeStudy applies also to phases in workload

Page 23: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 23

Some DVFS model details

All models are built with relative values and not absolute voltages, freq. or performanceFrom min Vcc – linear scaling of frequency only

Leakage vs. Voltage

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

1.10

0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05

Vcc [relative]

Rel

ativ

e L

eaka

ge

Leakage

X^3 Approximation

Chart Title

y = 1.0102x0.4414

R2 = 0.9986

0.0%

20.0%

40.0%

60.0%

80.0%

100.0%

120.0%

0% 20% 40% 60% 80% 100% 120%

Power [%]

Fre

qu

en

cy

[%

]

Pow er to Freq

Pow er (Pow er to Freq)

Frequency as a function of V_gate

1.5

2

2.5

3

3.5

4

0.60 0.70 0.80 0.90 1.00 1.10 1.20Freq [GHz]

Vcc

[V

]

Linear freq

Actual Freq.

Fre

q [

GH

z]

Voltage [V]

Page 24: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 24

Workload characteristics – few observations

• Application power is distributed around ~60% of max power– Min 33% - Leakage + idle

power– Very few apps reach 100%

• Scalability is evenly distributed

• No correlation found between power and scalability– OOO characteristics– Simpler core is expected to

show positive correlation

• Random pick of 16 cores:– Tighter overall power

distribution– Very low probability for all

application high or low power

Application Power distribution

-2

0

2

4

6

8

10

0% 20% 40% 60% 80% 100% 120%

Appplication count

Pro

bab

ilit

y

Apps power distribution

Norm Dist

Performance Scaling Score vs. Power

0.00

0.20

0.40

0.60

0.80

1.00

1.20

0% 20% 40% 60% 80% 100% 120%

Power [% of max]

Sca

lin

g [

Per

f/fr

eq]

Page 25: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 25

Why is VR constraint - physics

Battery

GFX

ControllerDriversInductors

CPU

Bulk Cap.Need close proximity

Page 26: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009Chip with Multiple Clock and Voltage Domains 26

Overview• How to best architect and manage Clock

and voltage domains of a CMP to achieve max performance under power constraints

• Contributions:– Power delivery constrains DVFS

• Multi-voltage domains not so easy

– Methodology to evaluate CMP workloads– Clustered voltage and clock domains

Page 27: Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel

Dec-2009 Chip with Multiple Clock and Voltage Domains 27

Work Overview - scope• 16 core Power constrained CMP• 1 thru 16 voltage regulators (VR) and clock

domains– Either on chip or off chip VR

• Independent clock domains requirea FIFO buffer increased latency

Best topology ?Optimal policy ?

Under constraints

PE #1

PE #2

PE #n

PMU

Cache

FIFO Buffer

Core#1

PMU

L2Cache

I/O and Memory

Inte

rco

nn

ect

Core#2

Coren#

DC/DC

DC/DC

DC/DC

VR

VR

VR

CPU

PE #1

PE #2

PE #n

PMU

Cache

FIFO Buffer

Core#1

PMU

L2Cache

I/O and Memory

Inte

rco

nn

ect

Core#2

Coren#

DC/DC

DC/DC

DC/DC

VR

VR

VR

DC/DC

DC/DC

DC/DC

VR

VR

VR

CPU