Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip...

Dec-2009 Chip with Multiple Clock and Voltage Domains 1

Multiple Clock and Voltage Domains for Chip Multi

Processors

December - 2009

Efraim RotemIntel Corporation, Israel

Ran GinosarTechnion, Israel

Avi Mendelson Microsoft R&D, Israel

Uri WeiserTechnion, Israel

Dec-2009 2

Compute Performance matters

1978 1982 1986 1990 1994 1998 2002 20061

10,000

Source: Dave Patterson

Fueled by a combination of process and arch

We would like to keep on providing performance – Power is #1 limiterBoth process technology and ILP slow down multi core architectures

100WAn order of magnitude more power efficient but deep in the power wall

Chip with Multiple Clock and Voltage Domains

Work Overview - scope• How to best architect and manage Clock and voltage

domains of a CMP to max performance under power constraints

• 16 core Power constrained CMP• 1 thru 16 voltage regulators (VR)

– Either on chip or off chip VR

• 1 thru 16 clock domains – FIFO buffers increase latency

• Paper contributions:– Power delivery constrains DVFS

• Multi-voltage domains not so easy

– Methodology to evaluate CMP workloads

– Clustered voltage and clock domains

FIFO Buffer

Core#1

L2Cache

I/O and Memory

Core#2

Coren#

FIFO Buffer

Core#1

L2Cache

I/O and Memory

Core#2

Coren#

Dec-2009 4

Operation point and constraints• Process technology voltages

– Voltage range Vmin – Vmax

– Frequency range fmin – 2fmin

– Nominal working point Vmin , fmin

• Lower bound on quality of service– Frequency DFS down to ½ fmin

• Total power is a constraint– Not exceed nominal power

• Power delivery has been added as a constraint

• Most constraining parameter winsChip with Multiple Clock and Voltage Domains

Why is VR a constraint? Simplified example • Given a 16 core 100A shared power delivery

– Tying all cores together allows sharing current among cores– Allow one core to consume all the current

I/16 I/16

I / 16

• Assume we can split the same VR into 16– Allow each core a fixed 100A/16

– Sharing is not possible– Keeping capability requires 1,600A!

Dec-2009

Power delivery is constrained

• Need power delivery headroom for performance

• Replacing 1 VR by 16 individual VRs:– Does not allow current sharing between cores– Results in degraded power delivery

• New technologies:– Need less area / volume, BUT– Still deliver limited current

• More details in the paper

Dec - 2009 Chip with Multiple Clock and Voltage Domains 7

Modeling methodology Workload construction

Hybrid model• Offline characterization of a real CPU:

– Instrumented Intel® Core™-2 Duo for power performance measurements

– Characterized SPEC-2K traces behavior– Extracted DVFS parameters and V/F scaling

• Cycle accurate simulation for FIFO impacts– 3 clocks each direction

• Coded analytic model to calculate performance– Function of power frequency and workload

Workload construction• Typical Multi Threaded benchmarks

insufficient– Server or HPC centric

• Highly regular and uniform

– But client and cloud computing is non uniform• We performed Monte-Carlo simulation

– Used SPEC-2K as an application pool– Randomly assigned a subset of 16 threads to

the cores– Both fully and partially threaded studies – Performed all studies on the same workload– Repeated workload selection and analysis 200

Results

Dec-2009 Chip with Multiple Clock and Voltage Domains11

Baseline: Single Voltage and Clock DVFS

• 10-25% performance gain from use of power headroom

• Serves as baseline for the studies to follow

• 200 random workloads

• DVFS to lowest constraint

• Sorted by performance

• Shown relative performance

Baseline performance gain

1 21 41 61 81 101 121 141 161 181

Workload

Baseline

20 40 60 80 100 120 140 160 180 200

100% = 16XGalgel

140% = 16XCrafty

Different topologies - Fully threaded workloads

• Example with power supply capability of 150%

• Some workloads gain performance, some lose compared to baseline

– In contrast with previous studies – Assign budget asymmetrically • 200 random workloads

• Oracle study

• Three topologies vs. baseline

• Each Sorted independently

• Performance relative to baseline

Relative Performance

20 40 60 80 100 120 140 160 180

Workloads (sorted)

nVnC / 1V1C

1VnC / 1V1C

nVnC / 1VnC

Relative Performance

20 40 60 80 100 120 140 160 180

Workloads (sorted)

nVnC / 1V1C

1VnC / 1V1C

nVnC / 1VnC

50% appsLoose perf

50% appsbetter perf

1V – Single voltage domainnV – Multiple Voltage domains1C – Single Clock domainnC – Multiple Clock domains

Dec-2009

2T 4T 8T 12T 14T 16T110%

Performance vs. Threads and policy 250% headroom

1V1CnVnC1VnC

Number of threads

Chip with Multiple Clock and Voltage Domains 13

Partially threaded workload

• Fewer threads higher benefit from shared power

Multi VR better

Single VR better

1C – Single Clock domainnC – Multiple Clock domains

1V – Single voltage domainnV – Multiple Voltage domainsOracle Study

Gaining the best of both worlds: Clusters

• N clusters with 16/N cores each• Sharing VR between cores in a cluster• Setting optimal voltage frequency for each cluster

Dec-2009Chip with Multiple Clock and Voltage Domains

Clusters• Clustered topology almost equal to the best of both

topologies• Outperforms both when number of threads = number of

clusters

1V – Single voltage domainnV – Multiple Voltage domains1C – Single Clock domainnC – Multiple Clock domainsxT – X Threads

Performance vs. Treads and policy250% headroom

2T 4T 8T 12T 14T 16T

Number of threads

nVnC-8C-SM

Cluster always the best

How to pick the best cluster size?• Oracle study• Compared to non-clustered (by workload)• Calculated quadratic error from best topology• Best scenarios highlighted• “Diagonal behavior”

– More constrained power delivery larger clusters

110% 130% 150% 200% 250%1V1C 7.1% 11.4% 13.2% 14.8% 16.6%1VnC 5.1% 9.0% 10.7% 12.4% 14.1%nVnC-2C 28.6% 13.0% 14.1% 15.4% 17.5%nVnC-4C 45.8% 14.7% 13.3% 12.2% 13.9%nVnC-8C 55.6% 21.9% 16.5% 9.8% 7.6%Columns – power delivery capabilityRows – number of clustersCells showing distance from Oracle (Smaller is better)

Summary

• Power delivery is a major CPU perf. constraint– Overlooked by previous works– Multiple voltage domain do not allow power sharing– Lightly threaded workloads are most constrained

• Clustered topology mitigates sharing limitations– Allows sharing power within subsets of cores– Optimal cluster size: function of power delivery capability

• Explored the non uniform workloads– Different application types– Partially vs. fully threaded workloads

Thank You

Run time policies• Policy to:

– Evaluate run time parameters and select frequency

• Three control functions – Input: power or scalability

– Compute: frequency for each core• Scale each domain to lowest constrain (e.g. power delivery, max

• Calculated quadratic error from Oracle results

Input – Power / Scalability

Input – Power / ScalabilityF

Greedy (Winner Takes All) Linear Polynomial

Linear dependency3 ParmF

Run time policy results• Winning policy is a greedy (WTA) based on scalability

– Very close to Oracle

• Random and power based policies are not good policies

Max AverageWTA 50% 5.84% 1.3%WTA 33% 4.41% 0.6%WTA 10% 1.23% 0.0%WTA by Power 50% 22.76% 6.9%Linear by SCA 9.60% 6.1%Linear by power 49.76% 36.6%Polynomial by SCA 5.23% 3.3%Random 33.28% 19.9%

1VnCMax Average

WTA 50% 2.90% 0.8%WTA 33% 3.37% 0.8%WTA 10% 4.63% 1.7%WTA by Power 50% 4.60% 2.3%Linear by SCA 2.72% 1.5%Linear by power 5.77% 3.8%Polinomial by SCA 3.58% 1.5%Random 8.66% 4.3%

Distance from Oracle (Smaller is better)WTA – Winner Take AllSCA - Scalability

Workload characterization

• Measured score at two frequencies

• Measured total CPU power– Scaled power =

(Workload Power)/(Max Power)– Results 33%-100%

• leakage + Idle is ~30%

– Most applications use less than 100% power

• Even at Vmax , fmax they consume less than Imax

• Reason: Not all parts of the CPU are utilized

• Scalability = ΔPerf/ΔFrequency– Result 0%-100%

• Low Memory bound• High CPU bound

SPEC intScaled Power

Perf. Scaling

with freq.FIFO

impact

gzip 48% 0.95 0.13%

vpr 44% 0.68 2.92%

gcc 35% 0.67 0.92%

mcf 49% 0.30 2.92%

crafty 33% 0.99 0.59%

parser 60% 0.78 1.29%

eon 42% 0.99 0.00%

perlbmk 50% 1.00 0.31%

gap 45% 0.56 1.14%

vortex 60% 0.73 1.45%

bzip2 49% 0.70 0.71%

twolf 97% 0.99 4.68%

Int_rate 51% 0.77 1.42%

Workload characterization

• Used cycle accurate simulation to evaluate FIFO impact / application

A B CC

SPEC intScaled Power

Perf. Scaling

with freq.FIFO

impact

gzip 48% 0.95 0.13%

vpr 44% 0.68 2.92%

gcc 35% 0.67 0.92%

mcf 49% 0.30 2.92%

crafty 33% 0.99 0.59%

parser 60% 0.78 1.29%

eon 42% 0.99 0.00%

perlbmk 50% 1.00 0.31%

gap 45% 0.56 1.14%

vortex 60% 0.73 1.45%

bzip2 49% 0.70 0.71%

twolf 97% 0.99 4.68%

Int_rate 51% 0.77 1.42%

All studies are average over the entire run, not accounting for variance over timeStudy applies also to phases in workload

Some DVFS model details

All models are built with relative values and not absolute voltages, freq. or performanceFrom min Vcc – linear scaling of frequency only

Leakage vs. Voltage

0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05

Vcc [relative]

Leakage

X^3 Approximation

Chart Title

y = 1.0102x0.4414

R2 = 0.9986

100.0%

120.0%

0% 20% 40% 60% 80% 100% 120%

Power [%]

Pow er to Freq

Pow er (Pow er to Freq)

Frequency as a function of V_gate

0.60 0.70 0.80 0.90 1.00 1.10 1.20Freq [GHz]

Linear freq

Actual Freq.

Voltage [V]

Workload characteristics – few observations

• Application power is distributed around ~60% of max power– Min 33% - Leakage + idle

power– Very few apps reach 100%

• Scalability is evenly distributed

• No correlation found between power and scalability– OOO characteristics– Simpler core is expected to

show positive correlation

• Random pick of 16 cores:– Tighter overall power

distribution– Very low probability for all

application high or low power

Application Power distribution

0% 20% 40% 60% 80% 100% 120%

Appplication count

Apps power distribution

Norm Dist

Performance Scaling Score vs. Power

0% 20% 40% 60% 80% 100% 120%

Power [% of max]

Why is VR constraint - physics

Battery

ControllerDriversInductors

Bulk Cap.Need close proximity

Dec-2009Chip with Multiple Clock and Voltage Domains 26

Overview• How to best architect and manage Clock

and voltage domains of a CMP to achieve max performance under power constraints

• Contributions:– Power delivery constrains DVFS

• Multi-voltage domains not so easy

– Methodology to evaluate CMP workloads– Clustered voltage and clock domains

Work Overview - scope• 16 core Power constrained CMP• 1 thru 16 voltage regulators (VR) and clock

domains– Either on chip or off chip VR

• Independent clock domains requirea FIFO buffer increased latency

Best topology ?Optimal policy ?

Under constraints

FIFO Buffer

Core#1

L2Cache

I/O and Memory

Core#2

Coren#

FIFO Buffer

Core#1

L2Cache

I/O and Memory

Core#2

Coren#

Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip...

Documents

Manual Digital Clock Switching Device DTSG 4 – …...The digital-clock-switching device matches the protection goals of the low voltage directive 73/ 231/ EWG and the harmonized

Clock Network Synthesis - Computer Sciencecseweb.ucsd.edu/~jlu/papers/dmst-aspdac10/slides.pdfASP-DAC 2010 Introduction Distribution of clock network PVT (process, voltage, temperature)

MTBF Estimation in Coherent Clock Domainsran/papers/MTBFCoherentCDC-beer-AS… · MTBF Estimation in Coherent Clock Domains Salomon Beer 1, Ran Ginosar 1, Rostislav (Reuven) Dobkin

Serial real-time clock - STMicroelectronicsOctober 2011 Doc ID 10772 Rev 5 1/28 1 M41T00S Serial real-time clock Features 2.0 to 5.5 V clock operating voltage Counters for seconds,

C64 SERVICE MANUAL - classic-computing.org · C64 CIRCUIT THEORY The C64 Clock Circuits. Crystal Y1 develops a 14.31818MHz fundamental frequency clock signal. U31 is a Dual Voltage

HALF SIZE DIP LOW VOLTAGE 5.0V CRYSTAL CLOCK … · ACH Series HALF SIZE DIP LOW VOLTAGE 5.0V CRYSTAL CLOCK OSCILLATOR Pb RoHS/RoHS II Compliant Moisture Sensitivity Level (MSL) –

jindalelectric.com · power consumer. Industrial Robot Automatic voltage controller is an industrial robot which continuously monitors the voltage variation round the clock & whenever

Clock Skew Scheduling with Delay Padding for Prescribed ...€¦ · Clock Skew Scheduling with Delay Padding for Prescribed Skew Domains Chuan Lin and Hai Zhou Jan 25, 2007 (2) Outline

DVFS Based on Voltage Dithering and Clock Scheduling for ... · DVFS Based on Voltage Dithering and Clock Scheduling for GALS Systems Manoj Kumar Yadav Mario R. Casu Maurizio Zamboni

ECE 448: Lab 5 Serial Communications. Part 1: Serial Communications Part 2: Clock Management Part 3: Clock Domains Part 4: User Constraint File (UCF)

5. Clock Networks and PLLs in Stratix IV Devices · clock structures that provide up to 236 unique clock domains (16 GCLKs + 88 RCLKs + 132 PCLKs) within the Stratix IV device and

Automatic Test Equipment - eecs.ceas.uc.edueecs.ceas.uc.edu/~jonewb/ATE.pdf · Test Rate, Drover Clock, I/O Drive Enable clock . Voltage setup Driver, Comparator, Terminator, Program

INFINITY Strain/DC Current/Voltage Meter › Pdf › INFS.pdf · INFINITY®Strain/DC Current/Voltage Meter Operator’s Manual. Counters Frequency Meters PID Controllers Clock/Timers

E cient Self-Timed Interfaces for Crossing Clock …...E cient Self-Timed Interfaces for Crossing Clock Domains by Ajanta Chakraborty B.Eng. Bhopal Engineering College, 2001 A THESIS

Ryzen - images-eu.ssl-images-amazon.com+XfTXS.pdf · Zen is divided into a number of clock domains, each operating at a certain frequency : UClk- UMC Clock The frequency at which

Mapping CRMP3 domains involved in dendrite …Journal of Cell Science Mapping CRMP3 domains involved in dendrite morphogenesis and voltage-gated calcium channel regulation Tam T. Quach1,

Multiple Clock Domains (MCD) Continued … Arvind with Nirav Dave

400 MHz Low Voltage PECL Clock Synthesizer

A DPLL-based per Core Variable Frequency Clock …researcher.watson.ibm.com/.../DPLL_P7_VLSI_2010_FINAL_SLIDES.pdf · Slide 5 Microprocessor Frequency ... Voltage Regulator ... •

Programmable Low-Voltage 1:10 LVDS Clock Driver datasheet