27
The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang Electrical & Computer Engineering

The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

Embed Size (px)

Citation preview

Page 1: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty BarrierEnergy-Aware Synchronization

in Shared-Memory Multiprocessors

Jian Li and José F. MartínezComputer Systems Laboratory

Michael C. HuangElectrical & Computer Engineering

Page 2: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Motivation

Multiprocessor architectures sprouting everywhere• large compute servers• small servers, desktops• chip multiprocessors

High energy consumption a problem – more so in MPs

Most power-aware techniques tailored at uniprocessors Multiprocessors present unique challenges

• processor co-ordination, synchronization

Page 3: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Case: Barrier Synchronization

Fast threads spin-wait for slower ones Spin-wait wasteful by definition

• quick reaction• but only last iteration useful

spin

-wai

tco

mpu

te

Page 4: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Proposal: Thrifty Barrier

Reduce spin-wait energy waste in barriers• leverage existing processor sleep states (e.g. ACPI)

Minimize impact on execution time• achieve timely wake-up

conventional thrifty

Page 5: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Challenges

Should sleep?• transition times (sleep + wake-up) non-negligible

What sleep state?• more energy savings → longer transition times

When to wake up?• early w.r.t. barrier release → may hurt energy savings• late w.r.t. barrier release → may hurt performance

Must predict barrier stall time accurately

Page 6: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Findings

Many barrier stall times large enough to leverage sleep states

Stall times predictable• discriminate through PC indexing• predict indirectly using barrier interval times

Timely wake-up: combination of two mechanisms• coherence message bounds wake-up latency• watchdog timer anticipates wake-up

Page 7: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Thrifty Barrier Mechanism

BARRIER ARRIVAL

SLEEP?

S1 S2 S3 Wake-up signal

RESIDUAL SPIN

No

BARRIER DEPARTURE

Stall time prediction

Page 8: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Sleep Mechanism

BARRIER ARRIVAL

SLEEP?

S1 S2 S3 Wake-up signal

RESIDUAL SPIN

No

BARRIER DEPARTURE

Stall time prediction

Page 9: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Predicting Stall Time

Splash-2’s FMM example: 3 important barriers, 4 iterations• randomly picked thread (always the same)

PC indexing reduces variability Interval time (BIT) more stable metric than stall time (BST)

Page 10: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Stall Time vs. Interval Time

Barriers separate computation phases• PC indexing reduces variability

Barrier stall time (BST) varies considerably• even with PC indexing• barrier-, but also thread-dependent

– computation shifts among threads across invocations

Barrier interval time (BIT) varies much less• quite stable if PC indexing used• barrier-, but not thread-dependent• last-value prediction ok for most applications

Page 11: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Predicting Stall Time Indirectly

Can use BIT to predict BST indirectly• compute time measurable upon arrival to barrier• subtract from predicted BIT to derive predicted BST

How to manage time info?

BIT

BSTtComputet

Page 12: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Threads depart from barrier instance b-1 toward instance b Each thread t has local record of release timestamp BRTSt,b-1

Assumptions:• no global clock• local wallclock active even if CPU sleeps

– all CPUs same nominal clock frequency

Managing Time Info

b-1 b

BRTSt,b-1

Page 13: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Thread t arrives, knowing BRTSt,b-1, Computet,b• make prediction pBITb• derive pBSTt,b = pBITb – Computet,b• use pBSTt,b to pick sleep state (if warranted)

– best fit based on transition time

Managing Time Info

b-1 b

pBITb

pBSTt,bComputet,b

BRTSt,b-1

Page 14: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Last thread u arrives, knowing BRTSu,b-1• derive actual BITb = time( ) – BRTSu,b-1• update (shared) predictor with BITb

• release barrier

Managing Time Info

b-1 b

BITbBRTSu,b-1

Page 15: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Every thread t (possibly after waking up late)• read BITb from updated predictor• compute actual BRTSt,b = BRTSt,b-1 + BITb

Threads never use timestamps (BRTS) from other threads• no global clock is needed

Managing Time Info

b-1 b

BITbBRTSt,b-1 BRTSt,b

*

Page 16: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Thrifty Barrier Mechanism

BARRIER ARRIVAL

SLEEP?

S1 S2 S3 Wake-up signal

RESIDUAL SPIN

No

BARRIER DEPARTURE

Stall time prediction

Page 17: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Wake-up Mechanism

BARRIER ARRIVAL

SLEEP?

S1 S2 S3 Wake-up signal

RESIDUAL SPIN

No

BARRIER DEPARTURE

Stall time prediction

Page 18: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Wake-up Mechanism

Communicate barrier completion to sleeping CPUs• signal sent to CPU pin• options: external vs. internal wake-up

External (passive): initiated by processor that releases barrier• leverage coherence protocol – invalidation to spinlock• must supply spinlock address to cache controller

Internal (active): triggered by watchdog timer• program with predicted BST before going to sleep

Page 19: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Early vs. Late Wake-up

Early wake-up (underprediction)• energy waste – residual spin

Late wake-up (overprediction)• possible impact on execution time

External wake-up guarantees late wake-up (but bounded) Internal wake-up can lead to both (late not bounded)

Our approach: hybrid wake-up• external provides upper bound• internal strives for timely wake-up using prediction

Page 20: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Other Considerations (see paper)

Sleep states that do not snoop for coherence requests• flush dirty data before sleeping• defer invalidations to clean data

Overprediction threshold• case of frequent, swinging BITs of modest size• turn off prediction if overpredict beyond threshold

Interaction with context switching and I/O• underprediction threshold

Time sharing issues: multiprogramming, overthreading

Page 21: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Experimental Setup

Simultated system: 64-node CC-NUMA• 6-way dynamic superscalar• L1 16KB 64B 2-way 2clk; L2 64KB 64B 8-way 12clk• 16B/4clk memory bus, 60ns SDRAM• hypercube, wormhole, 4clk pipelined routers

– 16clk pin to pin Energy modeling: Wattch (CPU + L1 + L2)

• sleep states along lines of Pentium family

Page 22: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Experimental Setup

All Splash-2 applications except:• Raytrace – no barriers• LU – better version w/o barriers widely available

Efficiency (64p) 40-82%, avg. 58% Target Group ≥ 10%

Page 23: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Energy Savings

Page 24: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Performance Impact

Page 25: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Related Work Highlights

Quite a bit of work in uniprocessor domain

Elnozahy et al.• server farms, clusters

– thirfty barrier targets shared memory, parallel apps.

Moshovos et al., Saldanha and Lipasti• energy-aware cache coherence

– prob. compatible with and complementary to thrifty barrier

Page 26: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty Barrier – Li, Martínez, and Huang

Conclusions

Energy-aware MP mechanisms can and should be pursued

Case of energy-aware barrier synchronization• simple indirect prediction of barrier stall time• hybrid wake-up scheme to minimize impact on exec. time

Encouraging results; target applications• 17% avg. energy savings• 2% avg. performance impact

Page 27: The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang

The Thrifty BarrierEnergy-Aware Synchronization

in Shared-Memory Multiprocessors

Jian Li and José F. MartínezComputer Systems Laboratory

Michael C. HuangElectrical & Computer Engineering