29
John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Embed Size (px)

Citation preview

Page 1: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

John M. Mellor-Crummey

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Joseph Garvey & Joshua San Miguel

Michael L. Scott

Page 2: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Dance Hall Machines?

Page 3: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

•Various insns known as fetch_and_ф insns: test_and_set, fetch_and_store, fetch_and_add, compare_and_swap

•Some can be used to simulate others but often with overhead

•Some lock types require a particular primitive to be implemented or to be implemented efficiently

Atomic Instructions

Page 4: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

type lock = (unlocked, locked)

procedure acquire_lock (lock *L)

while test_and_set (L) == locked ;

procedure release_lock (lock *L)

*L = unlocked

Test_and_set: Basic

Page 5: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Test_and_set: Basic

$

P

$

P

$

P

Memory

Page 6: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

type lock = (unlocked, locked)

procedure acquire_lock (lock *L)

while 1

if *L == unlocked

if test_and_set (L) == unlocked

return

procedure release_lock (lock *L)

*L = unlocked

Test_and_set: test_and_test_and_set

Page 7: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Test_and_set: test_and_test_and_set

$

P

$

P

$

P

Memory

Page 8: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

type lock = (unlocked, locked)

procedure acquire_lock (lock *L)

delay = 1

while test_and_set (L) == locked

pause (delay)

delay = delay * 2

procedure release_lock (lock *L)

*L = unlocked

Test_and_set: test_and_set with backoff

Page 9: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

type lock = record

next_ticket = 0

now_serving = 0

procedure acquire_lock (lock *L)

my_ticket = fetch_and_increment(L->next_ticket)

while 1

if L->now_serving == my_ticket

return

procedure release_lock (lock *L)

L->now_serving = L->now_serving + 1

Ticket Lock

Page 10: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

type lock = record

slots = array [0…numprocs – 1] of (has_lock, must_wait)

next_slot = 0

procedure acquire_lock (lock *L)

my_place = fetch_and_increment (L->next_slot)

// Various modulo work to handle overflow

while L->slots[my_place] == must_wait ;

L->slots[my_place] = must_wait

procedure release_lock (lock *L)

L->slots[my_place + 1] = has_lock

Array-Based Queuing Locks

Page 11: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Array-Based Queuing Locks

Memorynext_slot

slots

$

P

my_place

$

P

my_place

$

P

my_place

Page 12: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

type qnode = record

qnode *next

bool locked

type lock = qnode*

procedure acquire_lock (lock *L, qnode *I)

I->next = Null

qnode *predecessor = fetch_and_store (L, I)

if predecessor != Null

I->locked = true

predecessor->next = I

while I->locked ;

MCS Locks

procedure release_lock (lock *L, qnode *I)

if I->next == Null

if compare_and_swap (L, I, Null)

return

while I->next == Null ;

I->next->locked = false

Page 13: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

MCS Locks

L1-R

2-B

3-B

2-R

3-R3-E

4-B

5-B

4-R

procedure release_lock (lock *L, qnode *I)

if I->next == Null

if compare_and_swap (L, I, Null)

return

while I->next == Null ;

I->next->locked = false

Page 14: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Results: Scalability – Distributed Memory Architecture

Page 15: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Results: Scalability – Cache Coherent Architecture

Page 16: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

•Butterfly’s atomic insns are very expensive

•Butterfly can’t handle 24-bit pointers

Results: Single Processor Lock/Release Time

Times are in μs Test_and_set Ticket Anderson (Queue)

MCS

Butterfly (Distributed)

34.9 38.7 65.7 71.3

Symmetry (Cache coherent)

7.0 NA 10.6 9.2

Page 17: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Results: Network Congestion

Busy-wait Lock Increase in Network Latency Measured From

Lock Node Idle Node

test_and_set 1420% 96%

test_and_set w/ linear backoff

882% 67%

test_and_set w/ exp. backoff

32% 4%

ticket 992% 97%

ticket w/ prop backoff 53% 8%

Anderson 75% 67%

MCS 4% 2%

Page 18: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

•Atomic insns >> normal insns && 1 processor latency is very important don’t use MCS

•If processes might be preempted test_and_set with exponential backoff

Which lock should I use?

fetch_and_store supported?

fetch_and_increment supported?

Yes No

test_and_set w/ exp backoff

Ticket

MCSYes No

Page 19: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Centralized BarrierP0P1P2P3

01234

Page 20: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Software Combining Tree BarrierP0P1P2P3

012 10 2

102

P0P1

P2P3

Page 21: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Tournament BarrierP0P1P2P3

P0 P1 P2 P3

W

C

L W

L

L

Page 22: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Dissemination BarrierP0P1P2P3

P0 P1 P2 P3

Page 23: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

New Tree-Based BarrierP0P1P2P3

012

0

0

0

3

Page 24: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Summary

Barrier Space Wakeup Local Spinning Network Txns

Centralized O(1) broadcast no O(p) or O(∞)

Software Combining Tree

O(p) tree no O(p × fan-in) or O(∞)

Tournament O(plogp) tree yes O(p)

Dissemination O(plogp) none yes O(plogp)

New Tree-Based O(p) tree yes 2p - 2

Page 25: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Results – Distributed Shared MemoryBarrier Space Wakeup Local Spinning Network Txns

Centralized O(1) broadcast no O(p) or O(∞)

Software Combining Tree

O(p) tree no O(p × fan-in) or O(∞)

Tournament O(plogp) tree yes O(p)

Dissemination O(plogp) none yes O(plogp)

New Tree-Based O(p) tree yes 2p - 2

Page 26: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Results – Broadcast-Based Cache-CoherentBarrier Space Wakeup Local Spinning Network Txns

Centralized O(1) broadcast no O(p) or O(∞)

Software Combining Tree

O(p) tree no O(p × fan-in) or O(∞)

Tournament O(plogp) tree yes O(p)

Dissemination O(plogp) none yes O(plogp)

New Tree-Based O(p) tree yes 2p - 2

Page 27: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Results – Local vs. Remote Spinning

Barrier Network Latency (local)

Network Latency (remote)

New Tree-Based 10% increase 124% increase

Dissemination 18% increase 117% increase

Page 28: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

Barrier Decision Tree

Multiprocessor?

Dissemination Barrier

Centralized Barrier

New Tree-Based Barrier(tree wakeup)

New Tree-Based Barrier

(central wakeup)

Distributed Shared Memory

Broadcast-Based Cache-

Coherent

Page 29: John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott

•No dance hall

•No need for complicated hardware synch

•Need a full set of fetch_and_ф

Architectural Recommendations