View
214
Download
1
Embed Size (px)
Citation preview
Dynamic Frequency-Voltage Scalingfor
Multiple Clock Domain Processors
and Implications on
Asymmetric Multiple Core Processors
Avshalom Elyada
DVS, Avshalom Elyada, EE Faculty, Technion2
Based primarily on the work of
Greg Semeraro, David H.Albonesi et. al.
University of Rochester, NY.And also
Diana Marculescu et. al.
Carnegie Mellon University, PA.
DVS, Avshalom Elyada, EE Faculty, Technion3
Outline• Multiple Clock Domains
• Inter-domain communication and synchronization
• Dynamic Frequency-Voltage Scaling
• Scaling algorithms– Offline, Attack-Decay, Dynamic Profiling– Results comparison
• DVS in Multiple Core Processors...?
DVS, Avshalom Elyada, EE Faculty, Technion4
End of the Road forGlobally-Synchronous
• Global hi-freq clock does not scale well– Low clock reachability
within a single clock cycle• Interconnect does not scale
well• Clock-tree complexity,
skew, power-inefficiency
DVS, Avshalom Elyada, EE Faculty, Technion5
Multiple-Clock-Domainsor Globally Asynchronous Locally Synchronous
• Divide core into separate clock domains
• Synchronize communication between synchronous “islands”
• Speedup freq of separate smaller domains
• Good inter-domain communication design– To minimize synchronization performance costs
• Retain traditional synchronous knowledge-base
DVS, Avshalom Elyada, EE Faculty, Technion6
MCD Processor(Alpha 21264–like Model, Rochester DVS research)
DVS, Avshalom Elyada, EE Faculty, Technion7
Multi-Synchronous
Multi-Synchronous
Each domainseparate clock
at same frequency
GloballySynchronous
Single clock
MCD
DVS, Avshalom Elyada, EE Faculty, Technion8
Dynamic Frequency-Voltage Scaling
• If all domains always run at max freq, this is usually a waste of power
• Only critical domain need run at max freq, others can run slower
• This saves power
• Performance degradation should be minimal
DVS, Avshalom Elyada, EE Faculty, Technion9
MCD and GALS
Multi-Synchronous
Each domainseparate clock
at same frequency
MCD
Globally AsyncLocally Sync
Async domains:Different frequency
per domain
GloballySynchronous
Single clock
DVS, Avshalom Elyada, EE Faculty, Technion10
Integer Dominated
DVS, Avshalom Elyada, EE Faculty, Technion11
Load-Store Dominated
DVS, Avshalom Elyada, EE Faculty, Technion12
D(F)VS Continued• 20-40% Energy-Delay improvement• Voltage scales down with freq, saving additional
power:– Potential for X3 savings
2DDPower fV
1 1
DD
DelayV f
2 3, 1 ( ) ( )f Power f f f
• Careful : wrong scaling is catastrophic on performance
DVS, Avshalom Elyada, EE Faculty, Technion13
Scaling is Gradual and Occurs During Regular Operation
• F may be decreased before V decreased
• V must be increased before F may increase
Voltage
Freq (MHz)
F-V working points
1.000V 1.172V
727.3
729.6
DVS, Avshalom Elyada, EE Faculty, Technion14
MCD and GALS
Multi-Synchronous
Each domainseparate clock
at same frequency
MCD
GALS
Async domains:Different frequency
per domain
Autonomous
GloballySynchronous
Single clock
DVS)C-GALS(
Different frequency per domain
Centrally controlled
DVS, Avshalom Elyada, EE Faculty, Technion15
Configuration Parameters (XScale-like)
• 320 Frequency-Voltage working-points
• Freq range 250-1000 MHz
• Voltage range 0.65-1.20 V– Step between work-points: 0.172 mV / 2.34 MHz – Change rate: 0.172 uSec / Step
(55uSec end-to-end)
• Time step: change each 50K cycles
DVS, Avshalom Elyada, EE Faculty, Technion16
DVS per domain - Considerations• Scaling algorithm:
– Determine F-V point of each domain at any time– Temporal granularity
• how often to change the F-V point
• Synchronization– Multi-Sync - all domains run @ same freq
• Simple sync solutions exist (phase compensation)
– When GALS – different and changing frequencies• Asynchronous sync. solution, impedes performance• Or think of better solutions…
DVS, Avshalom Elyada, EE Faculty, Technion17
Power-bounded DVS• Given power
envelope• Mobilize energy
between domains to attain max performance
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Time-step
En
erg
y
Integer Front-end Floating-point
External Memory Memory Domain
DVS, Avshalom Elyada, EE Faculty, Technion18
Scaling Algorithm• Input : A serial program
• Output: Parallel, temporal specification of which domains slowed by how much
• Temporal Granularity– Time-step should be short enough to be dynamic– Too short ineffective due to:
• Gradual scaling• Overhead of the change
DVS, Avshalom Elyada, EE Faculty, Technion19
Scaling Algorithms• ‘Offline’ Algorithm
– Full preparation on a simulator– Insert F-V config instructions for actual run
• ‘Online’ (Attack-Decay)– Done entirely in hardware– Rescale F-V acc. to internal queue levels
• Dynamic Profiling– Short profile run, find program phases – Rescale F-V on phase transitions
DVS, Avshalom Elyada, EE Faculty, Technion20
Offline Algorithm• Run the program on a simulator
at max speed, trace Primitive Events
– Primitive event = work performed in single domain on behalf of single instruction
• Construct Directed Acyclic Graph– functional and data dependencies
between primitive events– Arcs represent time between events
DVS, Avshalom Elyada, EE Faculty, Technion21
Offline Algorithm Contd
Stretch Slack
• Slack appears on non-critical paths
• Stretch events that are not in critical time path
DVS, Avshalom Elyada, EE Faculty, Technion22
Offline Algorithm Contd.• Now we have desired scale-down of single
primitive events
• Need to scale down domains per time-step– Construct Event Histograms per domain per time-
step: H(domain, time-step)– Assign tolerable performance degradation %p– Determine actual scale-down per-domain
according to (H, p)
DVS, Avshalom Elyada, EE Faculty, Technion23
OnlineAlgorithm
• Each time step, sample input queue levels– Attack: if queue level up by ~2%, inc freq by 6%– Decay: if level unchanged, dec freq ~0.2%
• Simple, HW only, results ~70% of offline• Watch out for perturbations, local-minima,
over-activism & other feedback-related pitfalls
freq
queuefull
DVS, Avshalom Elyada, EE Faculty, Technion24
DVS, Avshalom Elyada, EE Faculty, Technion25
Dynamic Profiling• Execution shows repeating Program Phases
– Phase often delimited by subroutine call or loop
• Dynamic Profiling:– Identify phases by a short profiling run– Insert phase marks and FV config into program– When program reaches a mark, reconfig FV
DVS, Avshalom Elyada, EE Faculty, Technion26
Results Comparison
DVS, Avshalom Elyada, EE Faculty, Technion27
Improved Dynamic Profiling
• Each program will carry its phase-information as initial setup data– Assuming phase info not processor-specific– alternatively, processor-specific compilation
• Or, processor itself will perform the profile run– HW based dynamic profiling,
eliminating the need forsimulation pre-run
DVS, Avshalom Elyada, EE Faculty, Technion28
DVS in ACCMP
• Conceptual Difference:– MCD Processor: sub-units run @ diff. freq.
– MCP: Threads run @ diff. freq.
• ACCMP - different size cores
• ACCMP with DVS - Cores also dynamically change frequency
DVS, Avshalom Elyada, EE Faculty, Technion29
DVS - Degree of Freedom• ACCMP
– Allocate thread to static strength processor:
LM
S
LMS performance
• ACCMP with DVS– Scale processor to performance needs– Dynamically accommodate 40-50 36-44
32-38
Stretch-fit
40-5036-44
32-38
DVS, Avshalom Elyada, EE Faculty, Technion30
Dynamic Thread Allocation
Performance
Pow
er
LargeMediumSmall
• 3 sizes DVS processors
DVS, Avshalom Elyada, EE Faculty, Technion31
Dynamic Thread Allocation
Performance
Powe
r
Large
Medium
Small
• 3 sizes DVS processors• Thread “wants” performance
between M & L processors
DVS, Avshalom Elyada, EE Faculty, Technion32
Dynamic Thread Allocation
Performance
Powe
r
Large
Medium
Small
• 3 sizes DVS processors• Thread “wants” performance
between M & L processors• Allocate to M only, hurt
performance, but still better than static ACCMP
DVS, Avshalom Elyada, EE Faculty, Technion33
Dynamic Thread Allocation
Performance
Powe
r
Large
Medium
Small
• 3 sizes DVS processors• Thread “wants” performance
between M & L processors• Allocate to M only, hurt
performance, but still better than static ACCMP
• To L only, waste power
DVS, Avshalom Elyada, EE Faculty, Technion34
Dynamic Thread Allocation
Performance
Powe
r
Large
Medium
Small
• 3 sizes DVS processors• Thread “wants” performance
between M & L processors• Allocate to M only, hurt
performance, but still better than static ACCMP
• To L only, waste power• Or migrate between both,
acc. to performance needs• What is best?
DVS, Avshalom Elyada, EE Faculty, Technion35
• k Migrations M↔L processors
• Phases φM, φL on each of the processors
Migration
( ) ( )mig
M L
M M L L
T T
Energy kE Pwr f dt Pwr f dt
( ) ( ) ( ) ( )mig
M L
M M M M L L L LEnergy kE Pwr T Pwr T
( ) ( )mig
M L
M M L LDelay kE T T
Performance
Pow
er
Large
Medium
Small
minkEnergy Delay
DVS, Avshalom Elyada, EE Faculty, Technion36
The End
DVS, Avshalom Elyada, EE Faculty, Technion37
DVS in Multiple Core Processors
• Asymmetric Cores– Asymmetric size cores suggested to better utilize
die area when too few threads• But research shows symmetric cores perform better
when have enough threads
– With DVS, a core’s performance dynamically varies acc. to freq.
• Viewed in a Performance/Energy metric, this is a more flexible kind of asymmetry …
• Also Simplify SW decision of which thread to assign to which asymmetric core
DVS, Avshalom Elyada, EE Faculty, Technion38
Inter-Domain Communication• In order to minimize synchronization penalty
– divide area into domains where there inherently exists a dual-port queue structure
• Dual-port FIFO synchronization solution
– Otherwise divide where minimum inter-domain communication
Dual-PortFIFO
synchronizer
wclkwen
wdata
full
rclkren
rdata
empty
Producer Domain
Producer Domain
Consumer Domain
Consumer Domain
DVS, Avshalom Elyada, EE Faculty, Technion39
Dual-Port FIFO
• Producer/Consumer domains can write/read independently as long as FIFO is not full or empty
• Full & Empty are the only signals that need syncing
• Therefore sync penalty incurred only when FIFO is full or empty
DVS, Avshalom Elyada, EE Faculty, Technion40
Syncing Periodic Domains– Synchronization solutions which exploit no knowledge of
clock relations are sub-optimal• Examples: two-flop and even dual-port FIFO
– DVS: clock relations are Periodic, Dynamic, and Known• Predictive Synchronizer can predict when conflict will
occur between different periodic clocks– But conflict prediction sometimes adapts slowly to freq changes
– DVS makes possible to exploit the fact that domain frequencies are Known
• Propose a multi-freq. sync. that can detect conflict by knowing at which freq. it’s provider and consumer run
DVS, Avshalom Elyada, EE Faculty, Technion41
Gradual Scaling• Device works throughout the change
• Necessary for 2 reasons– Online algorithm based on steadily changing
feedback control– ? Synchronizers can’t cope with step-change
• Using Dynamic Profiling + adequate synchronizers, can do instant scaling