Upload
alondra-horwood
View
214
Download
1
Embed Size (px)
Citation preview
The Thrifty BarrierEnergy-Aware Synchronization
in Shared-Memory Multiprocessors
Jian Li and José F. MartínezComputer Systems Laboratory
Michael C. HuangElectrical & Computer Engineering
The Thrifty Barrier – Li, Martínez, and Huang
Motivation
Multiprocessor architectures sprouting everywhere• large compute servers• small servers, desktops• chip multiprocessors
High energy consumption a problem – more so in MPs
Most power-aware techniques tailored at uniprocessors Multiprocessors present unique challenges
• processor co-ordination, synchronization
The Thrifty Barrier – Li, Martínez, and Huang
Case: Barrier Synchronization
Fast threads spin-wait for slower ones Spin-wait wasteful by definition
• quick reaction• but only last iteration useful
spin
-wai
tco
mpu
te
The Thrifty Barrier – Li, Martínez, and Huang
Proposal: Thrifty Barrier
Reduce spin-wait energy waste in barriers• leverage existing processor sleep states (e.g. ACPI)
Minimize impact on execution time• achieve timely wake-up
conventional thrifty
The Thrifty Barrier – Li, Martínez, and Huang
Challenges
Should sleep?• transition times (sleep + wake-up) non-negligible
What sleep state?• more energy savings → longer transition times
When to wake up?• early w.r.t. barrier release → may hurt energy savings• late w.r.t. barrier release → may hurt performance
Must predict barrier stall time accurately
The Thrifty Barrier – Li, Martínez, and Huang
Findings
Many barrier stall times large enough to leverage sleep states
Stall times predictable• discriminate through PC indexing• predict indirectly using barrier interval times
Timely wake-up: combination of two mechanisms• coherence message bounds wake-up latency• watchdog timer anticipates wake-up
The Thrifty Barrier – Li, Martínez, and Huang
Thrifty Barrier Mechanism
BARRIER ARRIVAL
SLEEP?
S1 S2 S3 Wake-up signal
RESIDUAL SPIN
No
BARRIER DEPARTURE
Stall time prediction
The Thrifty Barrier – Li, Martínez, and Huang
Sleep Mechanism
BARRIER ARRIVAL
SLEEP?
S1 S2 S3 Wake-up signal
RESIDUAL SPIN
No
BARRIER DEPARTURE
Stall time prediction
The Thrifty Barrier – Li, Martínez, and Huang
Predicting Stall Time
Splash-2’s FMM example: 3 important barriers, 4 iterations• randomly picked thread (always the same)
PC indexing reduces variability Interval time (BIT) more stable metric than stall time (BST)
The Thrifty Barrier – Li, Martínez, and Huang
Stall Time vs. Interval Time
Barriers separate computation phases• PC indexing reduces variability
Barrier stall time (BST) varies considerably• even with PC indexing• barrier-, but also thread-dependent
– computation shifts among threads across invocations
Barrier interval time (BIT) varies much less• quite stable if PC indexing used• barrier-, but not thread-dependent• last-value prediction ok for most applications
The Thrifty Barrier – Li, Martínez, and Huang
Predicting Stall Time Indirectly
Can use BIT to predict BST indirectly• compute time measurable upon arrival to barrier• subtract from predicted BIT to derive predicted BST
How to manage time info?
BIT
BSTtComputet
The Thrifty Barrier – Li, Martínez, and Huang
Threads depart from barrier instance b-1 toward instance b Each thread t has local record of release timestamp BRTSt,b-1
Assumptions:• no global clock• local wallclock active even if CPU sleeps
– all CPUs same nominal clock frequency
Managing Time Info
b-1 b
BRTSt,b-1
The Thrifty Barrier – Li, Martínez, and Huang
Thread t arrives, knowing BRTSt,b-1, Computet,b• make prediction pBITb• derive pBSTt,b = pBITb – Computet,b• use pBSTt,b to pick sleep state (if warranted)
– best fit based on transition time
Managing Time Info
b-1 b
pBITb
pBSTt,bComputet,b
BRTSt,b-1
The Thrifty Barrier – Li, Martínez, and Huang
Last thread u arrives, knowing BRTSu,b-1• derive actual BITb = time( ) – BRTSu,b-1• update (shared) predictor with BITb
• release barrier
Managing Time Info
b-1 b
BITbBRTSu,b-1
The Thrifty Barrier – Li, Martínez, and Huang
Every thread t (possibly after waking up late)• read BITb from updated predictor• compute actual BRTSt,b = BRTSt,b-1 + BITb
Threads never use timestamps (BRTS) from other threads• no global clock is needed
Managing Time Info
b-1 b
BITbBRTSt,b-1 BRTSt,b
*
The Thrifty Barrier – Li, Martínez, and Huang
Thrifty Barrier Mechanism
BARRIER ARRIVAL
SLEEP?
S1 S2 S3 Wake-up signal
RESIDUAL SPIN
No
BARRIER DEPARTURE
Stall time prediction
The Thrifty Barrier – Li, Martínez, and Huang
Wake-up Mechanism
BARRIER ARRIVAL
SLEEP?
S1 S2 S3 Wake-up signal
RESIDUAL SPIN
No
BARRIER DEPARTURE
Stall time prediction
The Thrifty Barrier – Li, Martínez, and Huang
Wake-up Mechanism
Communicate barrier completion to sleeping CPUs• signal sent to CPU pin• options: external vs. internal wake-up
External (passive): initiated by processor that releases barrier• leverage coherence protocol – invalidation to spinlock• must supply spinlock address to cache controller
Internal (active): triggered by watchdog timer• program with predicted BST before going to sleep
The Thrifty Barrier – Li, Martínez, and Huang
Early vs. Late Wake-up
Early wake-up (underprediction)• energy waste – residual spin
Late wake-up (overprediction)• possible impact on execution time
External wake-up guarantees late wake-up (but bounded) Internal wake-up can lead to both (late not bounded)
Our approach: hybrid wake-up• external provides upper bound• internal strives for timely wake-up using prediction
The Thrifty Barrier – Li, Martínez, and Huang
Other Considerations (see paper)
Sleep states that do not snoop for coherence requests• flush dirty data before sleeping• defer invalidations to clean data
Overprediction threshold• case of frequent, swinging BITs of modest size• turn off prediction if overpredict beyond threshold
Interaction with context switching and I/O• underprediction threshold
Time sharing issues: multiprogramming, overthreading
The Thrifty Barrier – Li, Martínez, and Huang
Experimental Setup
Simultated system: 64-node CC-NUMA• 6-way dynamic superscalar• L1 16KB 64B 2-way 2clk; L2 64KB 64B 8-way 12clk• 16B/4clk memory bus, 60ns SDRAM• hypercube, wormhole, 4clk pipelined routers
– 16clk pin to pin Energy modeling: Wattch (CPU + L1 + L2)
• sleep states along lines of Pentium family
The Thrifty Barrier – Li, Martínez, and Huang
Experimental Setup
All Splash-2 applications except:• Raytrace – no barriers• LU – better version w/o barriers widely available
Efficiency (64p) 40-82%, avg. 58% Target Group ≥ 10%
The Thrifty Barrier – Li, Martínez, and Huang
Energy Savings
The Thrifty Barrier – Li, Martínez, and Huang
Performance Impact
The Thrifty Barrier – Li, Martínez, and Huang
Related Work Highlights
Quite a bit of work in uniprocessor domain
Elnozahy et al.• server farms, clusters
– thirfty barrier targets shared memory, parallel apps.
Moshovos et al., Saldanha and Lipasti• energy-aware cache coherence
– prob. compatible with and complementary to thrifty barrier
The Thrifty Barrier – Li, Martínez, and Huang
Conclusions
Energy-aware MP mechanisms can and should be pursued
Case of energy-aware barrier synchronization• simple indirect prediction of barrier stall time• hybrid wake-up scheme to minimize impact on exec. time
Encouraging results; target applications• 17% avg. energy savings• 2% avg. performance impact
The Thrifty BarrierEnergy-Aware Synchronization
in Shared-Memory Multiprocessors
Jian Li and José F. MartínezComputer Systems Laboratory
Michael C. HuangElectrical & Computer Engineering