View
213
Download
0
Category
Tags:
Preview:
Citation preview
Dec-2009 Chip with Multiple Clock and Voltage Domains 1
Multiple Clock and Voltage Domains for Chip Multi
Processors
December - 2009
Efraim RotemIntel Corporation, Israel
Ran GinosarTechnion, Israel
Avi Mendelson Microsoft R&D, Israel
Uri WeiserTechnion, Israel
Dec-2009 2
Compute Performance matters
1978 1982 1986 1990 1994 1998 2002 20061
10
100
1,000
10,000
Source: Dave Patterson
Fueled by a combination of process and arch
We would like to keep on providing performance – Power is #1 limiterBoth process technology and ILP slow down multi core architectures
1W
10W
100WAn order of magnitude more power efficient but deep in the power wall
Chip with Multiple Clock and Voltage Domains
Dec-2009 Chip with Multiple Clock and Voltage Domains 3
Work Overview - scope• How to best architect and manage Clock and voltage
domains of a CMP to max performance under power constraints
• 16 core Power constrained CMP• 1 thru 16 voltage regulators (VR)
– Either on chip or off chip VR
• 1 thru 16 clock domains – FIFO buffers increase latency
• Paper contributions:– Power delivery constrains DVFS
• Multi-voltage domains not so easy
– Methodology to evaluate CMP workloads
– Clustered voltage and clock domains
PE #1
PE #2
PE #n
PMU
Cache
FIFO Buffer
Core#1
PMU
L2Cache
I/O and Memory
Inte
rco
nn
ect
Core#2
Coren#
DC/DC
DC/DC
DC/DC
VR
VR
VR
CPU
PE #1
PE #2
PE #n
PMU
Cache
FIFO Buffer
Core#1
PMU
L2Cache
I/O and Memory
Inte
rco
nn
ect
Core#2
Coren#
DC/DC
DC/DC
DC/DC
VR
VR
VR
DC/DC
DC/DC
DC/DC
VR
VR
VR
CPU
Dec-2009 4
Operation point and constraints• Process technology voltages
– Voltage range Vmin – Vmax
– Frequency range fmin – 2fmin
– Nominal working point Vmin , fmin
• Lower bound on quality of service– Frequency DFS down to ½ fmin
• Total power is a constraint– Not exceed nominal power
• Power delivery has been added as a constraint
• Most constraining parameter winsChip with Multiple Clock and Voltage Domains
Dec-2009 Chip with Multiple Clock and Voltage Domains 5
Why is VR a constraint? Simplified example • Given a 16 core 100A shared power delivery
– Tying all cores together allows sharing current among cores– Allow one core to consume all the current
I/16
I/16
I/16
I/16
I/16
I/16
I/16
I/16
I/16
I/16
I/16 I/16
I / 16
I / 16
I/16
I/16
• Assume we can split the same VR into 16– Allow each core a fixed 100A/16
– Sharing is not possible– Keeping capability requires 1,600A!
I
Core
Dec-2009
Power delivery is constrained
• Need power delivery headroom for performance
• Replacing 1 VR by 16 individual VRs:– Does not allow current sharing between cores– Results in degraded power delivery
• New technologies:– Need less area / volume, BUT– Still deliver limited current
• More details in the paper
6
Dec - 2009 Chip with Multiple Clock and Voltage Domains 7
Modeling methodology Workload construction
Dec-2009 Chip with Multiple Clock and Voltage Domains 8
Hybrid model• Offline characterization of a real CPU:
– Instrumented Intel® Core™-2 Duo for power performance measurements
– Characterized SPEC-2K traces behavior– Extracted DVFS parameters and V/F scaling
• Cycle accurate simulation for FIFO impacts– 3 clocks each direction
• Coded analytic model to calculate performance– Function of power frequency and workload
Dec-2009 Chip with Multiple Clock and Voltage Domains 9
Workload construction• Typical Multi Threaded benchmarks
insufficient– Server or HPC centric
• Highly regular and uniform
– But client and cloud computing is non uniform• We performed Monte-Carlo simulation
– Used SPEC-2K as an application pool– Randomly assigned a subset of 16 threads to
the cores– Both fully and partially threaded studies – Performed all studies on the same workload– Repeated workload selection and analysis 200
times
Dec-2009 Chip with Multiple Clock and Voltage Domains 10
Results
Dec-2009 Chip with Multiple Clock and Voltage Domains11
Baseline: Single Voltage and Clock DVFS
• 10-25% performance gain from use of power headroom
• Serves as baseline for the studies to follow
• 200 random workloads
• DVFS to lowest constraint
• Sorted by performance
• Shown relative performance
Baseline performance gain
100%
105%
110%
115%
120%
125%
130%
1 21 41 61 81 101 121 141 161 181
Workload
Per
form
ance
[re
lati
ve t
o b
ase
freq
uen
cy]
Baseline
20 40 60 80 100 120 140 160 180 200
100% = 16XGalgel
140% = 16XCrafty
I
Core
Dec-2009 Chip with Multiple Clock and Voltage Domains 12
Different topologies - Fully threaded workloads
• Example with power supply capability of 150%
• Some workloads gain performance, some lose compared to baseline
– In contrast with previous studies – Assign budget asymmetrically • 200 random workloads
• Oracle study
• Three topologies vs. baseline
• Each Sorted independently
• Performance relative to baseline
Relative Performance
-6%
-4%
-2%
0%
2%
4%
6%
20 40 60 80 100 120 140 160 180
Workloads (sorted)
Rel
ativ
e p
erfo
rman
ce [
%]
nVnC / 1V1C
1VnC / 1V1C
nVnC / 1VnC
Relative Performance
-6%
-4%
-2%
0%
2%
4%
6%
20 40 60 80 100 120 140 160 180
Workloads (sorted)
Rel
ativ
e p
erfo
rman
ce [
%]
nVnC / 1V1C
1VnC / 1V1C
nVnC / 1VnC
50% appsLoose perf
50% appsbetter perf
1V – Single voltage domainnV – Multiple Voltage domains1C – Single Clock domainnC – Multiple Clock domains
Dec-2009
2T 4T 8T 12T 14T 16T110%
115%
120%
125%
130%
135%
140%
145%
150%
155%
160%
Performance vs. Threads and policy 250% headroom
1V1CnVnC1VnC
Number of threads
Per
ofr
man
ce
Chip with Multiple Clock and Voltage Domains 13
Partially threaded workload
• Fewer threads higher benefit from shared power
Multi VR better
Single VR better
1C – Single Clock domainnC – Multiple Clock domains
1V – Single voltage domainnV – Multiple Voltage domainsOracle Study
Dec-2009 Chip with Multiple Clock and Voltage Domains 14
Gaining the best of both worlds: Clusters
• N clusters with 16/N cores each• Sharing VR between cores in a cluster• Setting optimal voltage frequency for each cluster
I/4
I/4
I/4
I/4
Dec-2009Chip with Multiple Clock and Voltage Domains
15
Clusters• Clustered topology almost equal to the best of both
topologies• Outperforms both when number of threads = number of
clusters
1V – Single voltage domainnV – Multiple Voltage domains1C – Single Clock domainnC – Multiple Clock domainsxT – X Threads
Performance vs. Treads and policy250% headroom
110%
115%
120%
125%
130%
135%
140%
145%
150%
155%
160%
2T 4T 8T 12T 14T 16T
Number of threads
Pe
rofr
ma
nc
e
1V1C
nVnC
nVnC-8C-SM
Cluster always the best
Dec-2009 Chip with Multiple Clock and Voltage Domains 16
How to pick the best cluster size?• Oracle study• Compared to non-clustered (by workload)• Calculated quadratic error from best topology• Best scenarios highlighted• “Diagonal behavior”
– More constrained power delivery larger clusters
110% 130% 150% 200% 250%1V1C 7.1% 11.4% 13.2% 14.8% 16.6%1VnC 5.1% 9.0% 10.7% 12.4% 14.1%nVnC-2C 28.6% 13.0% 14.1% 15.4% 17.5%nVnC-4C 45.8% 14.7% 13.3% 12.2% 13.9%nVnC-8C 55.6% 21.9% 16.5% 9.8% 7.6%Columns – power delivery capabilityRows – number of clustersCells showing distance from Oracle (Smaller is better)
Dec-2009 Chip with Multiple Clock and Voltage Domains 17
Summary
• Power delivery is a major CPU perf. constraint– Overlooked by previous works– Multiple voltage domain do not allow power sharing– Lightly threaded workloads are most constrained
• Clustered topology mitigates sharing limitations– Allows sharing power within subsets of cores– Optimal cluster size: function of power delivery capability
• Explored the non uniform workloads– Different application types– Partially vs. fully threaded workloads
Dec-2009 Chip with Multiple Clock and Voltage Domains 18
Thank You
Dec-2009 Chip with Multiple Clock and Voltage Domains 19
Run time policies• Policy to:
– Evaluate run time parameters and select frequency
• Three control functions – Input: power or scalability
– Compute: frequency for each core• Scale each domain to lowest constrain (e.g. power delivery, max
freq)
• Calculated quadratic error from Oracle results
Input – Power / Scalability
Fre
q.
Input – Power / Scalability
Fre
q.
Input – Power / ScalabilityF
req
.
Greedy (Winner Takes All) Linear Polynomial
Linear dependency3 ParmF
Dec-2009 Chip with Multiple Clock and Voltage Domains 20
Run time policy results• Winning policy is a greedy (WTA) based on scalability
– Very close to Oracle
• Random and power based policies are not good policies
Max AverageWTA 50% 5.84% 1.3%WTA 33% 4.41% 0.6%WTA 10% 1.23% 0.0%WTA by Power 50% 22.76% 6.9%Linear by SCA 9.60% 6.1%Linear by power 49.76% 36.6%Polynomial by SCA 5.23% 3.3%Random 33.28% 19.9%
1VnCMax Average
WTA 50% 2.90% 0.8%WTA 33% 3.37% 0.8%WTA 10% 4.63% 1.7%WTA by Power 50% 4.60% 2.3%Linear by SCA 2.72% 1.5%Linear by power 5.77% 3.8%Polinomial by SCA 3.58% 1.5%Random 8.66% 4.3%
nVnC
Distance from Oracle (Smaller is better)WTA – Winner Take AllSCA - Scalability
Dec-2009 Chip with Multiple Clock and Voltage Domains 21
Workload characterization
• Measured score at two frequencies
• Measured total CPU power– Scaled power =
(Workload Power)/(Max Power)– Results 33%-100%
• leakage + Idle is ~30%
– Most applications use less than 100% power
• Even at Vmax , fmax they consume less than Imax
• Reason: Not all parts of the CPU are utilized
• Scalability = ΔPerf/ΔFrequency– Result 0%-100%
• Low Memory bound• High CPU bound
A
A
B C
B
SPEC intScaled Power
Perf. Scaling
with freq.FIFO
impact
gzip 48% 0.95 0.13%
vpr 44% 0.68 2.92%
gcc 35% 0.67 0.92%
mcf 49% 0.30 2.92%
crafty 33% 0.99 0.59%
parser 60% 0.78 1.29%
eon 42% 0.99 0.00%
perlbmk 50% 1.00 0.31%
gap 45% 0.56 1.14%
vortex 60% 0.73 1.45%
bzip2 49% 0.70 0.71%
twolf 97% 0.99 4.68%
Int_rate 51% 0.77 1.42%
Dec-2009 Chip with Multiple Clock and Voltage Domains 22
Workload characterization
• Used cycle accurate simulation to evaluate FIFO impact / application
A B CC
SPEC intScaled Power
Perf. Scaling
with freq.FIFO
impact
gzip 48% 0.95 0.13%
vpr 44% 0.68 2.92%
gcc 35% 0.67 0.92%
mcf 49% 0.30 2.92%
crafty 33% 0.99 0.59%
parser 60% 0.78 1.29%
eon 42% 0.99 0.00%
perlbmk 50% 1.00 0.31%
gap 45% 0.56 1.14%
vortex 60% 0.73 1.45%
bzip2 49% 0.70 0.71%
twolf 97% 0.99 4.68%
Int_rate 51% 0.77 1.42%
All studies are average over the entire run, not accounting for variance over timeStudy applies also to phases in workload
Dec-2009 Chip with Multiple Clock and Voltage Domains 23
Some DVFS model details
All models are built with relative values and not absolute voltages, freq. or performanceFrom min Vcc – linear scaling of frequency only
Leakage vs. Voltage
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
1.10
0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05
Vcc [relative]
Rel
ativ
e L
eaka
ge
Leakage
X^3 Approximation
Chart Title
y = 1.0102x0.4414
R2 = 0.9986
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
0% 20% 40% 60% 80% 100% 120%
Power [%]
Fre
qu
en
cy
[%
]
Pow er to Freq
Pow er (Pow er to Freq)
Frequency as a function of V_gate
1.5
2
2.5
3
3.5
4
0.60 0.70 0.80 0.90 1.00 1.10 1.20Freq [GHz]
Vcc
[V
]
Linear freq
Actual Freq.
Fre
q [
GH
z]
Voltage [V]
Dec-2009 Chip with Multiple Clock and Voltage Domains 24
Workload characteristics – few observations
• Application power is distributed around ~60% of max power– Min 33% - Leakage + idle
power– Very few apps reach 100%
• Scalability is evenly distributed
• No correlation found between power and scalability– OOO characteristics– Simpler core is expected to
show positive correlation
• Random pick of 16 cores:– Tighter overall power
distribution– Very low probability for all
application high or low power
Application Power distribution
-2
0
2
4
6
8
10
0% 20% 40% 60% 80% 100% 120%
Appplication count
Pro
bab
ilit
y
Apps power distribution
Norm Dist
Performance Scaling Score vs. Power
0.00
0.20
0.40
0.60
0.80
1.00
1.20
0% 20% 40% 60% 80% 100% 120%
Power [% of max]
Sca
lin
g [
Per
f/fr
eq]
Dec-2009 Chip with Multiple Clock and Voltage Domains 25
Why is VR constraint - physics
Battery
GFX
ControllerDriversInductors
CPU
Bulk Cap.Need close proximity
Dec-2009Chip with Multiple Clock and Voltage Domains 26
Overview• How to best architect and manage Clock
and voltage domains of a CMP to achieve max performance under power constraints
• Contributions:– Power delivery constrains DVFS
• Multi-voltage domains not so easy
– Methodology to evaluate CMP workloads– Clustered voltage and clock domains
Dec-2009 Chip with Multiple Clock and Voltage Domains 27
Work Overview - scope• 16 core Power constrained CMP• 1 thru 16 voltage regulators (VR) and clock
domains– Either on chip or off chip VR
• Independent clock domains requirea FIFO buffer increased latency
Best topology ?Optimal policy ?
Under constraints
PE #1
PE #2
PE #n
PMU
Cache
FIFO Buffer
Core#1
PMU
L2Cache
I/O and Memory
Inte
rco
nn
ect
Core#2
Coren#
DC/DC
DC/DC
DC/DC
VR
VR
VR
CPU
PE #1
PE #2
PE #n
PMU
Cache
FIFO Buffer
Core#1
PMU
L2Cache
I/O and Memory
Inte
rco
nn
ect
Core#2
Coren#
DC/DC
DC/DC
DC/DC
VR
VR
VR
DC/DC
DC/DC
DC/DC
VR
VR
VR
CPU
Recommended