Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada

Dynamic Frequency-Voltage Scalingfor

Multiple Clock Domain Processors

and Implications on

Asymmetric Multiple Core Processors

Avshalom Elyada

DVS, Avshalom Elyada, EE Faculty, Technion2

Based primarily on the work of

Greg Semeraro, David H.Albonesi et. al.

University of Rochester, NY.And also

Diana Marculescu et. al.

Carnegie Mellon University, PA.


Outline• Multiple Clock Domains

• Inter-domain communication and synchronization

• Dynamic Frequency-Voltage Scaling

• Scaling algorithms– Offline, Attack-Decay, Dynamic Profiling– Results comparison

• DVS in Multiple Core Processors...?


End of the Road forGlobally-Synchronous

• Global hi-freq clock does not scale well– Low clock reachability

within a single clock cycle• Interconnect does not scale

well• Clock-tree complexity,

skew, power-inefficiency


Multiple-Clock-Domainsor Globally Asynchronous Locally Synchronous

• Divide core into separate clock domains

• Synchronize communication between synchronous “islands”

• Speedup freq of separate smaller domains

• Good inter-domain communication design– To minimize synchronization performance costs

• Retain traditional synchronous knowledge-base


MCD Processor(Alpha 21264–like Model, Rochester DVS research)


Multi-Synchronous

Multi-Synchronous

Each domainseparate clock

at same frequency

GloballySynchronous

Single clock

MCD


Dynamic Frequency-Voltage Scaling

• If all domains always run at max freq, this is usually a waste of power

• Only critical domain need run at max freq, others can run slower

• This saves power

• Performance degradation should be minimal


MCD and GALS

Multi-Synchronous


at same frequency

MCD

Globally AsyncLocally Sync

Async domains:Different frequency

per domain

GloballySynchronous

Single clock


Integer Dominated


Load-Store Dominated


D(F)VS Continued• 20-40% Energy-Delay improvement• Voltage scales down with freq, saving additional

power:– Potential for X3 savings

2DDPower fV

1 1

DD

DelayV f

2 3, 1 ( ) ( )f Power f f f

• Careful : wrong scaling is catastrophic on performance


Scaling is Gradual and Occurs During Regular Operation

• F may be decreased before V decreased

• V must be increased before F may increase

Voltage

Freq (MHz)

F-V working points

1.000V 1.172V

727.3

729.6


MCD and GALS

Multi-Synchronous


at same frequency

MCD

GALS

Async domains:Different frequency

per domain

Autonomous

GloballySynchronous

Single clock

DVS)C-GALS(

Different frequency per domain

Centrally controlled


Configuration Parameters (XScale-like)

• 320 Frequency-Voltage working-points

• Freq range 250-1000 MHz

• Voltage range 0.65-1.20 V– Step between work-points: 0.172 mV / 2.34 MHz – Change rate: 0.172 uSec / Step

(55uSec end-to-end)

• Time step: change each 50K cycles


DVS per domain - Considerations• Scaling algorithm:

– Determine F-V point of each domain at any time– Temporal granularity

• how often to change the F-V point

• Synchronization– Multi-Sync - all domains run @ same freq

• Simple sync solutions exist (phase compensation)

– When GALS – different and changing frequencies• Asynchronous sync. solution, impedes performance• Or think of better solutions…


Power-bounded DVS• Given power

envelope• Mobilize energy

between domains to attain max performance

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Time-step

En

erg

y

Integer Front-end Floating-point

External Memory Memory Domain


Scaling Algorithm• Input : A serial program

• Output: Parallel, temporal specification of which domains slowed by how much

• Temporal Granularity– Time-step should be short enough to be dynamic– Too short ineffective due to:

• Gradual scaling• Overhead of the change


Scaling Algorithms• ‘Offline’ Algorithm

– Full preparation on a simulator– Insert F-V config instructions for actual run

• ‘Online’ (Attack-Decay)– Done entirely in hardware– Rescale F-V acc. to internal queue levels

• Dynamic Profiling– Short profile run, find program phases – Rescale F-V on phase transitions


Offline Algorithm• Run the program on a simulator

at max speed, trace Primitive Events

– Primitive event = work performed in single domain on behalf of single instruction

• Construct Directed Acyclic Graph– functional and data dependencies

between primitive events– Arcs represent time between events


Offline Algorithm Contd

Stretch Slack

• Slack appears on non-critical paths

• Stretch events that are not in critical time path


Offline Algorithm Contd.• Now we have desired scale-down of single

primitive events

• Need to scale down domains per time-step– Construct Event Histograms per domain per time-

step: H(domain, time-step)– Assign tolerable performance degradation %p– Determine actual scale-down per-domain

according to (H, p)


OnlineAlgorithm

• Each time step, sample input queue levels– Attack: if queue level up by ~2%, inc freq by 6%– Decay: if level unchanged, dec freq ~0.2%

• Simple, HW only, results ~70% of offline• Watch out for perturbations, local-minima,

over-activism & other feedback-related pitfalls

freq

queuefull



Dynamic Profiling• Execution shows repeating Program Phases

– Phase often delimited by subroutine call or loop

• Dynamic Profiling:– Identify phases by a short profiling run– Insert phase marks and FV config into program– When program reaches a mark, reconfig FV


Results Comparison


Improved Dynamic Profiling

• Each program will carry its phase-information as initial setup data– Assuming phase info not processor-specific– alternatively, processor-specific compilation

• Or, processor itself will perform the profile run– HW based dynamic profiling,

eliminating the need forsimulation pre-run


DVS in ACCMP

• Conceptual Difference:– MCD Processor: sub-units run @ diff. freq.

– MCP: Threads run @ diff. freq.

• ACCMP - different size cores

• ACCMP with DVS - Cores also dynamically change frequency


DVS - Degree of Freedom• ACCMP

– Allocate thread to static strength processor:

LM

S

LMS performance

• ACCMP with DVS– Scale processor to performance needs– Dynamically accommodate 40-50 36-44

32-38

Stretch-fit

40-5036-44

32-38


Dynamic Thread Allocation

Performance

Pow

er

LargeMediumSmall

• 3 sizes DVS processors



Performance

Powe

r

Large

Medium

Small

• 3 sizes DVS processors• Thread “wants” performance

between M & L processors



Performance

Powe

r

Large

Medium

Small


between M & L processors• Allocate to M only, hurt

performance, but still better than static ACCMP



Performance

Powe

r

Large

Medium

Small




• To L only, waste power



Performance

Powe

r

Large

Medium

Small




• To L only, waste power• Or migrate between both,

acc. to performance needs• What is best?


• k Migrations M↔L processors

• Phases φM, φL on each of the processors

Migration

( ) ( )mig

M L

M M L L

T T

Energy kE Pwr f dt Pwr f dt

( ) ( ) ( ) ( )mig

M L

M M M M L L L LEnergy kE Pwr T Pwr T

( ) ( )mig

M L

M M L LDelay kE T T

Performance

Pow

er

Large

Medium

Small

minkEnergy Delay


The End


DVS in Multiple Core Processors

• Asymmetric Cores– Asymmetric size cores suggested to better utilize

die area when too few threads• But research shows symmetric cores perform better

when have enough threads

– With DVS, a core’s performance dynamically varies acc. to freq.

• Viewed in a Performance/Energy metric, this is a more flexible kind of asymmetry …

• Also Simplify SW decision of which thread to assign to which asymmetric core


Inter-Domain Communication• In order to minimize synchronization penalty

– divide area into domains where there inherently exists a dual-port queue structure

• Dual-port FIFO synchronization solution

– Otherwise divide where minimum inter-domain communication

Dual-PortFIFO

synchronizer

wclkwen

wdata

full

rclkren

rdata

empty

Producer Domain

Producer Domain

Consumer Domain

Consumer Domain


Dual-Port FIFO

• Producer/Consumer domains can write/read independently as long as FIFO is not full or empty

• Full & Empty are the only signals that need syncing

• Therefore sync penalty incurred only when FIFO is full or empty


Syncing Periodic Domains– Synchronization solutions which exploit no knowledge of

clock relations are sub-optimal• Examples: two-flop and even dual-port FIFO

– DVS: clock relations are Periodic, Dynamic, and Known• Predictive Synchronizer can predict when conflict will

occur between different periodic clocks– But conflict prediction sometimes adapts slowly to freq changes

– DVS makes possible to exploit the fact that domain frequencies are Known

• Propose a multi-freq. sync. that can detect conflict by knowing at which freq. it’s provider and consumer run


Gradual Scaling• Device works throughout the change

• Necessary for 2 reasons– Online algorithm based on steadily changing

feedback control– ? Synchronizers can’t cope with step-change

• Using Dynamic Profiling + adequate synchronizers, can do instant scaling

Documents

Dynamic Frequency-Voltage Scaling for Multiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada