Upload
brit
View
46
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Evaluating Synchronization on Shared Address Space Multiprocessors: Methodology & Performance. Sanjeev Kumar Dongming Jiang Rohit Chandra Jaswinder Pal Singh. Classic Study on Synchronization. Software Algorithms for Locks and Barriers [Mellor-Crummey et. al., TOCS’91] - PowerPoint PPT Presentation
Citation preview
Evaluating Synchronization on Shared Address Space Multiprocessors:
Methodology & Performance
Sanjeev Kumar
Dongming Jiang
Rohit Chandra
Jaswinder Pal Singh
2
Classic Study on Synchronization
• Software Algorithms for Locks and Barriers [Mellor-Crummey et. al., TOCS’91]
– Multiprocessors machines• BBN Butterfly, Sequent Symmetry
– Microbenchmarks
• Little benefit from special hardware support– Handle memory/network contention in software
3
Case for Hardware Support
• Fetch&Op [Laudon et. al., ISCA’97]
– Origin 2000 – Microbenchmarks (Counter & Barrier)
• QOLB [Kagi et. al., ISCA’97]
– Simulations– Microbenchmarks & Applications (Locks)
• Better performance with Hardware Support
4
Our Study
• Re-examine synchronization– 64 processor Origin 2000
• New architectures CC-NUMA
• New primitives LL-SC
– Applications (SPLASH2) and microbenchmarks
• Applications : Little benefit from H/W support – Locks : Small performance sometimes– Barriers : Load-imbalance dominates
5
Outline
• Background
• Performance evaluation: Microbenchmarks– Synchronization primitives on Origin 2000– Lock and Barrier algorithms and performance
• Performance evaluation: Applications
• Is further hardware support valuable ?
• Conclusions
6
Spinning in Cache Cache Coherence
Spinning Traffic No Cache Coherence
Performance Tradeoffs : Wait
Synchronization Primitives on Origin 2000
• LL-SC– 2 instructions, Cached
– Flexible
• Fetch&Op– Special locations, uncached
– Inflexible e.g. Atomic Swap
Performance Tradeoffs : Atomic update Contention Retries Contention at Memory
7
Lock Algorithms (1)
• Simple– One location
NoAvailable ?
P P P P
Simple
Atomic test-and-setLL-SC Fetch&Op
8
Lock Algorithms (2)
• Ticket– Like in a bakery
– Proportional backoff
Atomic fetch-and-incrementLL-SC Fetch&Op
P P P P
Ticket
132Next-Ticket
125Now-Serving
126 127 132125
9
Lock Algorithms (3)
P P P P
MCS Queuing
0Queue0 0
• MCS– Queuing
– Local spinning
Atomic Compare-and-SwapLL-SC Not Fetch&Op
10
Lock-Delay Microbenchmark
0
4
8
12
1 2 4 8 16 32 64
Processors
Tim
e(u
s)
Simple,LL-SC TicketProp,LL-SC
MCS,LL-SC TicketProp,Fetch&Op
Simple (LL-SC)
TicketProp (LL-SC)MCS (LL-SC)TicketProp (Fetch&Op)
11
Barrier Algorithms (1)
• Central– Increment a counter
– Wait on a location 5Arrived
P P P P
Central
NoGo
Atomic fetch-and-incrementLL-SC Fetch&Op
12
P P P P
Tournament
0 00
0 0
0
Barrier Algorithms (2)
• Tournament– Tree of locations
– Spin on different locations
– Avoid hotspot and contention
Atomic fetch-and-incrementLL-SC Fetch&Op
13
Barrier-Null Microbenchmark
0
40
80
120
160
1 2 4 8 16 32 64
Processors
Tim
e(u
s)
Central,LL-SC Tournament,LL-SC
Central,Fetch&Op Central,Hybrid
Central (LL-SC)
Tournament (LL-SC)
Central (Fetch&Op)Hybrid (LL-SC, Fetch&Op)
14
Microbenchmarks Summary
• LL-SC– Simplest algorithms perform poorly
e.g. Simple lock and Central barrier– Smarter algorithms perform much better
• Fetch&Op supports faster synchronization
15
Outline
• Background
• Performance evaluation: Microbenchmarks
• Performance evaluation: Applications
• Is further hardware support valuable ?
• Conclusions
16
Choosing Applications: Methodology
• Applications from SPLASH-2 – Undo optimizations (Added locks and barriers)
• Problem Size – At least 25 fold speedup on 64 processors
• Base case– Best LL-SC lock and barrier
17
Base Performance
0
0.2
0.4
0.6
0.8
1
Raytrace Barnes Radiosity Ocean Water-Spatial Water-NSq
Bre
ak
do
wn
Compute+Communication Lock Barrier
0
16
32
48
64
Raytrace Barnes Radiosity Ocean Water-Spatial Water-NSq
Sp
ee
du
ps
18
Application performance usingDifferent Locks
• Better algorithm helps • Fetch&Op traffic hurts
0
0.2
0.4
0.6
0.8
1
1.2
Raytrace Barnes Radiosity Ocean Water-Spatial Water-NSq uBench
No
rma
lize
d S
pe
ed
up
s
Simple,LL-SCTicketProp,LL-SCTicketProp,Fetch&Op
1.65Base : MCS,LL-SC
19
Application performance using .Different Barriers .
• Load-imbalance dominates • Fetch&Op traffic hurts
0
0.2
0.4
0.6
0.8
1
1.2
Raytrace Barnes Radiosity Ocean Water-Spatial Water-NSq uBench
No
rma
lize
d S
pe
ed
up
s
Central,LL-SCCentral,Fech&OpCentral,Hybrid
1.52
2.68Base : Tournament,LL-SC
20
Applications Summary
• LL-SC– Locks : Better algorithm helps– Barriers : Load imbalance dominates
• Fetch&Op – Traffic due to spinning hurts performance
• Different from the microbenchmarks
21
Outline
• Background
• Performance evaluation: Microbenchmarks
• Performance evaluation: Applications
• Is further hardware support valuable ?– Locks– Barriers
• Conclusions
22
Sensitivity to Lock Performance
Adding round-trip network delays
0
0.2
0.4
0.6
0.8
1
1.2
Base 1 Round-trip 2 Round-trips 3 Round-trips 4 Round-trips
No
rmal
ize
d S
pe
ed
up
s
Raytrace Barnes Radiosity Ocean Water-Spatial Water-Nsq
RaytraceRadiosity
Extrapolate : 20-30 % improvement from better hardware
23
When do faster locks help Applications ?
• Applications sensitive to Lock performance– Raytrace, Radiosity ( ~ 20 -30 %)
• Substantial time in synchronization
• Small contended critical sections– Critical section size = actual + lock overhead
• Lock overhead dilates the critical section
• Effect on performance size of critical section
– 2 Apps : ~ 5 us (1-2 updates to shared locations)
24
Can we fix contention problems in these cases in the Application ?
• Yes. Fix was fairly easy– Raytrace
• Global counter Partial reductions
– Radiosity• Single buffer allocation queue Multiple
• Tasks added to local queue Distribute
• Significant performance improvement– Raytrace : 90%, Radiosity: 220%
25
Barriers
• Load-imbalance dominates
• Other applications– Well-balanced with little communication
• Like the microbenchmarks; Real applications ?
– Well-balanced computation & communication• SOR : nearest neighbor on a grid
• Barriers : 61 % execution time
• Still dominates. Communication Imbalance
26
• Fetch&op does not help Applications– At least for well-known lock & barrier algorithms
• Using applications is important
• Little benefit from hardware support– Locks: helps sometimes Fixable– Barriers: load imbalance dominates
• Sound Methodology
Summary & Conclusions
27
Tournament barrier with Fetch&Op
• Worse performance– Preliminary measurements indicated worse
overhead in addition to traffic
• Barrier performance did not make a difference in the applications
28
Small problem size
• Raytrace : Decreases lock time
• Barnes : Load-imbalance increases
• Water-Nsq : Load-imbalance and Serialization
• Ocean & SOR : Barrier time remains same
• Radiosity & Water-Spatial : Not available
29
SOR Breakdown
0
0.2
0.4
0.6
0.8
1
1 4 7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
Processors
Ex
ec
uti
on
tim
e
Compute + Communication Barrier
Load-imbalance is dominates time spent in barriers