34
SYNAR Systems Networking and Architecture Group SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture GroupSYNAR

Systems Networking and Architecture Group

CMPT 886: The Art of Scalable Synchronization

Dr. Alexandra FedorovaSchool of Computing Science

SFU

Page 2: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Synchronization

if(account_balance >= amount){

account_balance -= amount;}

Thread 1: perform a withdrawal if(account_balance >= service_fee){

account_balance -= service_fee;}

Thread 2: subtract service fee

1 2

3 4

Unsynchronized Access

Account balanced has changed between steps 2 and 4!!!

Synchronized Access

lock_aquire(account_balance_lock);if(account_balance >= amount){

account_balance -= amount;}lock_release(account_balance);

lock_aquire(account_balance_lock);if(account_balance >= service_fee){

account_balance -= service_fee;}lock_release(account_balance);

Page 3: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Synchronization Primitives (SP)• Synchronization primitives provide atomic access to a critical section• Types of synchronization primitives

– mutex– semaphore– lock– condition variable– etc.

• Synchronization primitives are provided by the OS• Can also be implemented by a library (e.g., pthreads) or by the

application • Hardware provides special atomic instructions for implementation of

synchronization primitives (test-and-set, compare-and-swap, etc.)

Page 4: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Implementation of SP • Performance of applications that use SP is determined by an

implementation of the SP• A SP must be scalable – must continue to perform well as the

number of contending threads increases• We will look at several implementations of locks to

understand how to create a scalable implementation

Page 5: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

What should you do if you can’t get a lock?

• Keep trying– “spin” or “busy-wait”– Good if delays are short

• Give up the processor– Good if delays are long– Always good on uniprocessor

• Systems usually use a combination:– Spin for a while, then give up the processor

• We will focus on multiprocessors, so we’ll look at spinlock implementations

© Herlihy-Shavit 2007

Page 6: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

A Shared Memory Multiprocessor

Bus

cache

memory

cachecache

© Herlihy-Shavit 2007

Page 7: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Basic Spinlock

CS

Resets lock upon exit

spin lock

critical section

...

…lock suffers from contention

Sequential Bottleneck no parallelism

© Herlihy-Shavit 2007

Page 8: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Review: Test-and-Set

• We have a boolean value in memory• Test-and-set (TAS)

– Swap true with prior value– Return value tells if prior value was true or false

• Can reset just by writing false

© Herlihy-Shavit 2007

Page 9: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

TAS• Provided by the hardware• Example SPARC: an assembly instruction load-store

unsigned byte ldstub

public class AtomicBoolean {

boolean value; public synchronized boolean getAndSet(boolean newValue)

{ boolean prior = value;

value = newValue;return prior;

}} Swap old and new values

loads a byte from memory to a return register writes the value 0xFF into the addressed byte atomically.

© Herlihy-Shavit 2007

• TAS can be implemented in a high-level language. • Example in Java:

Page 10: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

TAS Locks

• Value of TAS’ed memory shows lock state:– Lock is free: value is false– Lock is taken: value is true

• Acquire lock by calling TAS:– If result is false, you win– If result is true, you lose

• Release lock by writing false

Page 11: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

TAS Lock in SPARC Assemblyspin_lock:busy_loop: ldstub [%o0],%o1 tst %o1 bne busy_loop nop ! delay slot for branch retl nop ! delay slot for branch

loads old value into reg. o1. Writes “1” into memory at address in %o0 .Test if %o1 equals to zero.

If %o1 is not zero (old value is true), spin.

Page 12: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

TAS Lock in Java

class TASlock {

AtomicBoolean state = new AtomicBoolean(false);

void lock() {

while (state.getAndSet(true)) {} }

void unlock() {

state.set(false); }}

Initialize lock state to false (unlocked)

While lock is taken (true) spin.

Release the lock – set state to false

© Herlihy-Shavit 2007

Page 13: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Performance of TAS Lock• Experiment

– N threads on a multiprocessor– Increment shared counter 1 million times (total)– The thread acquires a lock before incrementing the counter– Each thread does 1,000,000/N increments

• N does not exceed the number of processors no thread switching overhead

• How long should it take?• How long does it take?

Page 14: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Expected performance

idealTota

l tim

e

Number of threads

no speedup because there is no parallelism

© Herlihy-Shavit 2007

lock_acquireincrementlock_release

lock_acquireincrementlock_release

lock_acquireincrementlock_release

lock_acquireincrementlock_release

Thread 1 Thread 2

same as sequential execution

Page 15: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Actual Performance

TAS lock

Ideal Much worse than ideal

Tota

l tim

e

Number of threads

© Herlihy-Shavit 2007

Page 16: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Reasons for Bad TAS Lock Performance

• Has to do with cache behaviour on the multiprocessor system

• TAS causes a lot of invalidation misses– This hurts performance

• To understand what this means, let’s review how caches work

Page 17: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Processor Issues Load Request

Bus

cache

memory

cachecache

datadata

© Herlihy-Shavit 2007

Page 18: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Another Processor Issues Load Request

Bus

cache

memory

cachecache

data

dataBus

I got data

dataBus

I want data

© Herlihy-Shavit 2007

Page 19: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

memory

Bus

Processor Modifies Data

cache cachecache

data

datadata

Now other copies are invalid

data

© Herlihy-Shavit 2007

Page 20: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Send Invalidation Message to Others

memory

Bus

cache cachecache

data

datadata data

Invalidate!

Bus

Other caches lose read permission

No need to change now: other caches can provide valid data

© Herlihy-Shavit 2007

Page 21: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Processor Asks for Data

memory

Bus

cache cachecache

data

datadata

Bus

I want data

data

© Herlihy-Shavit 2007

Page 22: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Multiprocessor Caches: Summary

• Simultaneous reads and writes of shared data:– Make data invalid

• Invalidation is bad for performance• On next data request:

– Data must be fetched from another cache• This slows down performance

Page 23: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

What This Has to Do with TAS Locks

• Recall that TAS lock had bad performance

• Invalidations were the cause

TAS lock

IdealTota

l tim

e

Number of threads

• Here is why: • All spinners do load/store in a loop• They all read/write the same location• Cause lots of invalidations

Page 24: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

A Solution: Test-And-Test-And-Set Lock

• Wait until lock “looks” free– Spin on local cache– No bus use while lock busy

class TTASlock {

AtomicBoolean state = new AtomicBoolean(false);

void lock() {while (true) {

while (state.get()) {} if (!state.getAndSet(true)) return;

}}

Wait until the lock looks free. We read the lock instead of TASing it. We avoid repeated invalidations.

Now try to acquire it.

© Herlihy-Shavit 2007

Page 25: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

TTAS Lock Performance

TAS lock

TTAS lock

Ideal Better, but still far from ideal

Tota

l tim

e

Number of threads

© Herlihy-Shavit 2007

Page 26: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

The Problem with TTAS Lock

• When the lock is released:– Everyone tries to acquire is– Everyone does TAS– There is a storm of invalidations

• Only one processor can use the bus at a time• So all processors queue up, waiting for the

bus, so they can perform the TAS

Page 27: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

A Solution: TTAS Lock with Backoff

• Intuition: If I fail to get the lock there must be contention

• So I should back off before trying again• Introduce a random “sleep” delay before

trying to acquire the lock again

© Herlihy-Shavit 2007

Page 28: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

TTAS Lock with Backoff: Performance

TAS lock

TTAS lockBackoff lockIdealTo

tal ti

me

Number of threads

© Herlihy-Shavit 2007

Page 29: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Backoff Locks

• Better performance than TAS and TTAS• Caveats:

– Performance is sensitive to the choice of delay parameter

– The delay parameter depends on the number of processors and their speed

– Easy to tune for one platform– Difficult to write an implementation that will work

well across multiple platforms© Herlihy-Shavit 2007

Page 30: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

An Idea

• Avoid useless invalidations– By keeping a queue of threads

• Each thread– Notifies next in line– Without bothering the others

© Herlihy-Shavit 2007

Page 31: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Anderson Queue Lock

flags

next

T F F F F F F F

idle

locations on which thread spin, one per thread

Points to the next unused

“spin” location acquiring getAndIncrement:atomically get value of“next”, and increment “next”pointer

If “next” was TRUE, lock is acquired

© Herlihy-Shavit 2007

Page 32: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Acquiring a Held Lock

flags

next

T F F F F F F F

acquired acquiring

getAndIncrement

T

released acquired

© Herlihy-Shavit 2007

Page 33: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Anderson Lock: Performance

TAS lock

TTAS lock

IdealAnderson lock

Tota

l tim

e

Number of threads

Almost ideal. We avoid all unnecessary invalidations.Portable – no tunable parameters.

© Herlihy-Shavit 2007

Page 34: SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group

Scalable Synchronization: Summary• Making synchronization primitives scalable is tricky• Performance tied to the hardware architecture• We looked at these spinlocks:

– TAS – poor performance due to invalidations– TTAS – avoids constant invalidations, but a storm of invalidations on lock

release– TTAS with backoff – eliminates the storm of invalidations on release– Anderson Queue Lock – completely eliminates all useless invalidations

• One could think of other optimizations…• For more information, look at the references in the syllabus