Click here to load reader

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

  • View
    109

  • Download
    0

Embed Size (px)

DESCRIPTION

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. John M. Mellor-Crummey. Michael L. Scott. Joseph Garvey & Joshua San Miguel. Dance Hall Machines?. Atomic Instructions. - PowerPoint PPT Presentation

Text of Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Slide 1

John M. Mellor-CrummeyAlgorithms for Scalable Synchronization on Shared-Memory MultiprocessorsJoseph Garvey & Joshua San MiguelMichael L. ScottDance Hall Machines?Various insns known as fetch_and_ insns: test_and_set, fetch_and_store, fetch_and_add, compare_and_swapSome can be used to simulate others but often with overheadSome lock types require a particular primitive to be implemented or to be implemented efficientlyAtomic Instructionstype lock = (unlocked, locked)

procedure acquire_lock (lock *L)while test_and_set (L) == locked ;

procedure release_lock (lock *L)*L = unlockedTest_and_set: BasicTest_and_set: Basic$P$P$PMemorytype lock = (unlocked, locked)

procedure acquire_lock (lock *L)while 1if *L == unlockedif test_and_set (L) == unlockedreturn

procedure release_lock (lock *L)*L = unlocked

Test_and_set: test_and_test_and_setTest_and_set: test_and_test_and_set$P$P$PMemorytype lock = (unlocked, locked)

procedure acquire_lock (lock *L)delay = 1while test_and_set (L) == lockedpause (delay)delay = delay * 2

procedure release_lock (lock *L)*L = unlocked

Test_and_set: test_and_set with backoffTest_and_set: test_and_set with backoff$P$P$PMemorytype lock = recordnext_ticket = 0now_serving = 0

procedure acquire_lock (lock *L)my_ticket = fetch_and_increment(L->next_ticket)while 1if L->now_serving == my_ticketreturn

procedure release_lock (lock *L)L->now_serving = L->now_serving + 1Ticket LockTicket LockMemorynext_ticketnow_serving$Pmy_ticket$Pmy_ticket$Pmy_tickettype lock = recordslots = array [0numprocs 1] of (has_lock, must_wait)next_slot = 0

procedure acquire_lock (lock *L)my_place = fetch_and_increment (L->next_slot)// Various modulo work to handle overflowwhile L->slots[my_place] == must_wait ;L->slots[my_place] = must_wait

procedure release_lock (lock *L)L->slots[my_place + 1] = has_lockArray-Based Queuing LocksArray-Based Queuing LocksMemorynext_slotslots$Pmy_place$Pmy_place$Pmy_placetype qnode = recordqnode *nextbool locked

type lock = qnode*

procedure acquire_lock (lock *L, qnode *I)I->next = Nullqnode *predecessor = fetch_and_store (L, I)if predecessor != NullI->locked = truepredecessor->next = Iwhile I->locked ;

MCS Locksprocedure release_lock (lock *L, qnode *I)if I->next == Nullif compare_and_swap (L, I, Null)returnwhile I->next == Null ;I->next->locked = falseMCS LocksL1-R2-B3-B2-R3-R3-E4-B5-B4-Rprocedure release_lock (lock *L, qnode *I)if I->next == Nullif compare_and_swap (L, I, Null)returnwhile I->next == Null ;I->next->locked = falseMCS LocksMemorylockqnodes$Plocknextlockednextlockednextlocked$Plocknextlockednextlockednextlocked$PlocknextlockednextlockednextlockedResults: Scalability Distributed Memory Architecture

Results: Scalability Cache Coherent Architecture

Butterflys atomic insns are very expensiveButterfly cant handle 24-bit pointersResults: Single Processor Lock/Release Time Times are in sTest_and_setTicketAnderson (Queue)MCSButterfly (Distributed)34.9 38.765.771.3Symmetry (Cache coherent)7.0NA10.69.2Results: Network CongestionBusy-wait LockIncrease in Network Latency Measured FromLock NodeIdle Nodetest_and_set1420%96%test_and_set w/ linear backoff882%67%test_and_set w/ exp. backoff32%4%ticket992%97%ticket w/ prop backoff53%8%Anderson75%67%MCS4%2%Atomic insns >> normal insns && 1 processor latency is very important dont use MCSIf processes might be preempted test_and_set with exponential backoffWhich lock should I use?fetch_and_store supported?fetch_and_increment supported?YesNotest_and_set w/ exp backoffTicketMCSYesNoCentralized BarrierP0P1P2P301234Software Combining Tree BarrierP0P1P2P3012102102P0P1P2P3Tournament BarrierP0P1P2P3P0P1P2P3WCLWLLDissemination BarrierP0P1P2P3P0P1P2P3New Tree-Based BarrierP0P1P2P30120003SummaryBarrierSpaceWakeupLocal SpinningNetwork TxnsCentralizedO(1)broadcastnoO(p) or O()Software Combining TreeO(p)treenoO(p fan-in) or O()TournamentO(plogp)treeyesO(p)DisseminationO(plogp)noneyesO(plogp)New Tree-BasedO(p)treeyes2p - 2Results Distributed Shared MemoryBarrierSpaceWakeupLocal SpinningNetwork TxnsCentralizedO(1)broadcastnoO(p) or O()Software Combining TreeO(p)treenoO(p fan-in) or O()TournamentO(plogp)treeyesO(p)DisseminationO(plogp)noneyesO(plogp)New Tree-BasedO(p)treeyes2p - 2

Results Broadcast-Based Cache-CoherentBarrierSpaceWakeupLocal SpinningNetwork TxnsCentralizedO(1)broadcastnoO(p) or O()Software Combining TreeO(p)treenoO(p fan-in) or O()TournamentO(plogp)treeyesO(p)DisseminationO(plogp)noneyesO(plogp)New Tree-BasedO(p)treeyes2p - 2

Results Local vs. Remote Spinning

BarrierNetwork Latency (local)Network Latency (remote)New Tree-Based10% increase124% increaseDissemination18% increase117% increaseBarrier Decision TreeMultiprocessor?Dissemination BarrierCentralized BarrierNew Tree-Based Barrier(tree wakeup)New Tree-Based Barrier(central wakeup)Distributed Shared MemoryBroadcast-Based Cache-CoherentNo dance hallNo need for complicated hardware synchNeed a full set of fetch_and_Architectural Recommendations

Search related