50
Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Embed Size (px)

Citation preview

Page 1: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Multiprocessors continuedQuad-core Kentsfield packageIBM's Power7 with eight cores and 32 Mbytes eDRAM

Page 2: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Quick overview

• Why do we have multi-processors?

• What type of programs will work well? What type work poorly?

Page 3: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Overview of today’s lecture

• Performance expectations– Amdahl’s law.

• Review– Shared memory vs. message passing– Interconnection networks– Thread-level parallelism

• Context– Bandwidth

• Coherence– Snoopy-bus consistency protocols (done in detail)– Directory-based consistency protocols (done quickly)

Page 4: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Let’s start at the top:Performance Expectations

• Amdahl's Law:– If a fraction P of your program can be sped up by a

factor of S, then your performance is:– So if P=0.3 and S=2 we get (1/(.7+.15))

1/.85 which is about 1.17.• Question:

– If you have a program in which 10% can’t be done in parallel (i.e. must be serial) and you’ve got 64 processors, what’s the best speed up you could hope for?

S

PP)(

1

1

Review

Page 5: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Review:Thread-Level Parallelism

• Thread-level parallelism (TLP)– Collection of asynchronous tasks: not started and stopped

together– Data shared loosely, dynamically

• Example: database/web server (each query is a thread)– accts is shared, can’t register allocate1 even if it were scalar– id and amt are private variables, register allocated to r1, r2

struct acct_t { int bal; };shared struct acct_t accts[MAX_ACCT];int id,amt;if (accts[id].bal >= amt){ accts[id].bal -= amt; spew_cash();}

0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash6: ... ... ...

Review: TLP

1Keyword in C is volatile

Page 6: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Thread level parallelism (TLP)

• We exploited parallelism at the instruction level (ILP)– Hardware can’t find TLP.

• Complier/programmer needs to find it (we lack the window size)

• Compilers getting better at this.

– But hardware can take advantage of TLP• Run on multiple cores• If data is shared, provide rules to the software about

what it can assume about shared data.

Review: TLP

Page 7: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Review:Why Shared Memory?

Pluses:– For applications looks like multitasking uniprocessor– For OS only evolutionary extensions required– Easy to do communication without OS– Software can worry about correctness first then performance

Minuses:– Proper synchronization is complex– Communication is implicit so harder to optimize– Hardware designers must implement

Result:– Traditionally bus-based Symmetric Multiprocessors (SMPs), and now the CMPs

are the most success parallel machines ever– And the first with multi-billion-dollar markets

Review: Shared memory vs. message passing

Page 8: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Alternative to shared memory?

• Explicit message passing– Rather difficult to code

• We’ll not be doing anything with it– But while not a 470

topic, it is important.• Graduate level parallel

programming class covers this in detail.

Review: Shared memory vs. message passing

Page 9: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

But shared memory systems have their own problems

• Two in particular– Cache coherence– Memory consistency model

• Different solutions depending on interconnect– So let’s do interconnect now, and jump back to

these two later.

Review: Shared memory vs. message passing

Page 10: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Review:Interconnect

• There are many ways to connect processors.– Shared network

• Bus-based• Crossbar

– Point-to-point• More topologies than I want to think about

– Mesh (in any amount of dimensionality)– Torus (in any amount of dimensionality)– nCube– Tree– etc. etc. etc. etc.

Review: Interconnect

Page 11: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Shared vs. Point-to-Point Networks• Shared network: e.g., bus (left)

+ Low latency– Low bandwidth: doesn’t scale beyond ~8 processors+ Shared property simplifies cache coherence protocols (later)

• Point-to-point network: e.g., mesh or ring (right)– Longer latency: may need multiple “hops” to communicate+ Higher bandwidth: scales to 1000s of processors– Cache coherence protocols are complex

CPU($)Mem

CPU($)Mem R

CPU($)Mem R

CPU($)MemR

CPU($)MemR

CPU($)Mem

CPU($)Mem

CPU($)Mem RRRR

Review: Interconnect

Page 12: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Organizing Point-To-Point Networks• Network topology: organization of network

– Tradeoff performance (connectivity, latency, bandwidth) cost• Router chips

– Networks that require separate router chips are indirect– Networks that use processor/memory/router packages are

direct+ Fewer components, “Glueless MP”

• Point-to-point network examples– Indirect tree (left)– Direct mesh or ring (right)

CPU($)Mem

CPU($)Mem

CPU($)Mem

CPU($)MemR RRR

R

R

R

CPU($)Mem R

CPU($)Mem R

CPU($)MemR

CPU($)MemR

Review: Interconnect

Page 13: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Implementation #1: Snooping Bus MP

• Two basic implementations• Bus-based systems

– Typically small: 2–8 (maybe 16) processors– Typically processors split from memories (UMA)

• Sometimes multiple processors on single chip (CMP)• Symmetric multiprocessors (SMPs)• Common, I use one everyday

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

Review: Interconnect

Page 14: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Implementation #2: Scalable MP

• General point-to-point network-based systems– Typically processor/memory/router blocks (NUMA)

• Glueless MP: no need for additional “glue” chips

– Can be arbitrarily large: 1000’s of processors• Massively parallel processors (MPPs)

CPU($)Mem R

CPU($)Mem R

CPU($)MemR

CPU($)MemR

Review: Interconnect

Page 15: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Context:Bandwidth

• As in the uniprocessor we need caches– To reduce average memory latency.

• But recall caches are also important for reducing bandwidth.– And now we’ve got (say) 4x as many requests– We’ve probably got a bandwidth problem.

• It’s important that caches work well!

CPU($)Mem

CPU($)Mem

CPU($)Mem

CPU($)Mem RRRR

Bandwidth!

Page 16: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Next: Shared data

• It’s easy without a cache!– Just make all accesses go to memory.– But we need a cache, both for latency and

bandwidth reasons.• With caches, two different processors might

see an incoherent view of the same memory location.– Let’s quickly see why.

Coherence!

Page 17: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

An Example Execution

• Two $100 withdrawals from account #241 at two ATMs– Each transaction maps to thread on different processor– Track accts[241].bal (address is in r3)

Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

Processor 1

0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

CPU0 MemCPU1

Coherence!

Page 18: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

No-Cache, No-Problem

• Scenario I: processors have no caches– No problem

Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

Processor 1

0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

500500

400

400

300

Coherence!

Page 19: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Cache Incoherence

• Scenario II: processors have write-back caches– Potentially 3 copies of accts[241].bal: memory, p0$, p1$– Can get incoherent (inconsistent)

Processor 00: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

Processor 1

0: addi r1,accts,r31: ld 0(r3),r42: blt r4,r2,63: sub r4,r2,r44: st r4,0(r3)5: call spew_cash

500V:500 500

D:400 500

D:400 500V:500

D:400 500D:400

Coherence!

Page 20: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Hardware Cache Coherence

• Coherence controller:– Examines bus traffic (addresses and

data)– Executes coherence protocol

• What to do with local copy when you see different things happening on bus

CPU

D$

data

D$

tags

CC

bus

Coherence!

Page 21: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Snooping Cache-Coherence Protocols

Bus provides serialization point

• Each cache controller “snoops” all bus transactions– take action to ensure coherence

• invalidate• update• supply value

– depends on state of the block and the protocol

Coherence!

Page 22: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Snooping Design ChoicesProcessorld/st

Snoop (observed bus transaction)

State Tag Data

. . .

• Controller updates state of blocks in response to processor and snoop events and generates bus xactions

• Often have duplicate cache tags• Snoopy protocol

– set of states– state-transition diagram– actions

• Basic Choices– write-through vs. write-back– invalidate vs. update

Cache

Coherence!

Page 23: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Simple bus-based scheme

• Write through/no write allocate with invalidate protocol.– Either have the data or don’t (Valid or Invalid)

• Bus just supports read and write.• When anyone else does a write, we go to I.

• Could go with an update protocol.– How would that be different?

• What are the 4C’s of caching now?

Coherence: Bus-based

Page 24: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

The Simple Invalidate Snooping Protocol

• Write-through, no-write-allocate cache

• Inputs: PrRd, PrWr, BusRd, BusWr

• Outputs: BusRd, BusWr

• Yes, that’s confusing!

PrWr / BusWr

Valid

BusWr

Invalid

PrWr / BusWr

PrRd / BusRd

PrRd / --

Coherence: Bus-based

Page 25: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch

Example timeProcessor 1 Processor 2 Bus

Processor Transaction

Cache State

Processor Transaction

Cache State

Read A

Read A

Read A

Write A

Read A

Write A

Write A

: • PrRd, PrWr, • BusRd, BusWr

Coherence: Bus-based

Page 26: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch

More Generally: MOESI

[Sweazey & Smith ISCA86]

M - Modified (dirty)

O - Owned (dirty but shared) WHY?

E - Exclusive (clean unshared) only copy, not dirty

S - Shared

I - Invalid

Variants MSI MESI MOSI MOESI

O

M

ES

I

ownership

validity

exclusiveness

Coherence: Bus-based

Page 27: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

MESI

• Write Back/Write allocate/Invalidate• Cache line is in one of 4 states

– Modified (dirty)– Exclusive (clean unshared) only copy, not dirty– Shared– Invalid

• Can issue 4 transactions on the bus (Read, Write, Read and Invalidate, Invalidate.

• Each processor snoops it’s own cache and checks to see if it has the data.– If so announces if it has it clean (HIT) or dirty (HITM)

Coherence: Bus-based

Page 28: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch

MESI example

•M - Modified (dirty)•E - Exclusive (clean unshared) only copy, not dirty•S - Shared•I - Invalid

Processor 1 Processor 2 Bus

Processor Transaction

Cache State

Processor Transaction

Cache State

Read A

Read A

Read A

Write A

Read A

Write A

Write A

Actions: • PrRd, PrWr, • BRL – Bus Read Line (BusRd)• BWL – Bus Write Line (BusWr)• BRIL – Bus Read and Invalidate• BIL – Bus Invalidate Line

Coherence: Bus-based

Page 29: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

A “wrong” FSM for MESI

It is actually not so much wrong as different (well, some of both)

Coherence: Bus-based

Page 30: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Let’s draw a better one.Coherence: Bus-based

Page 31: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

Processor Address Read/Write Bus transaction(s)

Hit/Miss HIT/HITM

“4C” miss type (if any)

1 0x100 Read

1 0x100 Write

1 0x210 Read

1 0x220 Read

2 0x210 Read

2 0x220 Write

1 0x220 Read

Proc 1 Proc 2

Address State Address State

Set 0 Set 0

Set 1 Set 1

Two processors, each have a 2-line direct-mapped cache with each line consisting of 16 bytes. R, W, R&I, I are the transactions. Fill in the tables.

Page 32: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Scalability problems of Snoopy Coherence

• Prohibitive bus bandwidth Required bandwidth grows with # CPUS… … but available BW per bus is fixed Adding busses makes serialization/ordering hard

• Prohibitive processor snooping bandwidth All caches do tag lookup when ANY processor accesses memory Inclusion limits this to L2, but still lots of lookups

• Upshot: bus-based coherence doesn’t scale beyond 8–16 CPUs

Coherence: Bus-based

Page 33: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Scalable Cache Coherence• Scalable cache coherence: two part solution

• Part I: bus bandwidth Replace non-scalable bandwidth substrate (bus)… …with scalable bandwidth one (point-to-point network, e.g., mesh)

• Part II: processor snooping bandwidth Interesting: most snoops result in no action Replace non-scalable broadcast protocol (spam everyone)… …with scalable directory protocol (only spam processors that care)

Coherence!

Page 34: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Directory Coherence Protocols

• Observe: physical address space statically partitioned+ Can easily determine which memory module holds a given line

That memory module sometimes called “home” – Can’t easily determine which processors have line in their caches Bus-based protocol: broadcast events to all processors/caches

± Simple and fast, but non-scalable

• Directories: non-broadcast coherence protocol Extend memory to track caching information For each physical cache line whose home this is, track:

Owner: which processor has a dirty copy (I.e., M state) Sharers: which processors have clean copies (I.e., S state)

Processor sends coherence event to home directory Home directory only sends events to processors that care

Coherence: Directory-based

Page 35: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Read Processing

Read A (miss)Node #1 Directory Node #2

Read A

Fill A A: Shared, #1

Coherence: Directory-based

Page 36: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Write Processing

Read A (miss)Node #1 Directory Node #2

Read A

Fill A A: Shared, #1

ReadExclusive AInvalidate A

Fill A

Trade-offs:• Longer accesses (3-hop between Processor, directory, other Processor)• Lower bandwidth no snoops necessaryMakes sense either for CMPs (lots of L1 miss traffic)

or large-scale servers (shared-memory MP > 32 nodes)

Inv Ack A

A: Mod., #2

Coherence: Directory-based

Page 37: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Serialization and Ordering

Let A and flag be initially 0

P1 P2

A += 5 while (flag == 0)

flag = 1 print A

Assume A and flag are in different cache blocks

What happens?

How do you implement it correctly?

Coherence: Directory-based

Page 38: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Coherence vs. Consistency

Intuition says loads should return latest value what is latest?

Coherence concerns only one memory location

Consistency concerns apparent ordering for all locations

A Memory System is Coherent if can serialize all operations to that location such that, operations performed by any processor appear in program order

program order = order defined program text or assembly code value returned by a read is value written by last store to that location

Coherence vs. Consistency

Page 39: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Why Coherence != Consistency

/* initial A = B = flag = 0 */

P1 P2

A = 1; while (flag == 0); /* spin */

B = 1; print A;

flag = 1; print B;

Intuition says printed A = B = 1

Coherence doesn’t say anything. Why?

Your uniprocessor ordering mechanisms (ld/st queue) hurts here. Why?

Coherence vs. Consistency

Page 40: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Sequential Consistency (SC)

switch randomly setafter each memory opprovides single sequentialorder among all operations

processorsissue memory opsin program order

P1 P2 P3

Memory

Consistency

Page 41: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Sufficient Conditions for SC

Every proc. issues memory ops in program order

Memory ops happen (start and end) atomically must wait for store to complete before issuing next memory op after load, issuing proc waits for load to complete, before issuing next op

Easily implemented with a shared bus

Consistency

Page 42: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Relaxed Memory Models

Motivation with Directory Protocols Misses have longer latency (do cache hits to get to next miss) Collecting acknowledgements can take even longer

Recall SC has Each processor generates at total order of its reads and writes

(R-->R, R-->W, W-->W, & W-->R) That are interleaved into a global total order

Example Relaxed Models

PC: Relax ordering from writes to (other proc’s) reads

RC: Relax all read/write orderings (but add “fences”)

Consistency

Page 43: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Locks

Locks

Page 44: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Lock-based Mutual Exclusion

xfer

Crit.

sec

rele

ase

Acquire starts

wait

wait

xfer

Crit.

sec

rele

ase

xfer

Crit.

sec

Acquire done

Release starts

Release done

Synchronizationperiod

No contention:• Want low latency

Contention:• Want low period• Low traffic• Fairness

Locks

Page 45: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

How Not to Implement Locks

• LOCKwhile (lock_variable == 1);

lock_variable = 1

• UNLOCKlock_variable = 0;

Context switch!

Locks

Page 46: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Solution: Atomic Read-Modify-Write

• Test&Set(r,x){r=m[x]; m[x]=1;}

• Fetch&Op(r1,r2,x,op){r1=m[x]; m[x]=op(r1,r2);}

• Swap(r,x){temp=m[x]; m[x]=r; r=temp;}

• Compare&Swap(r1,r2,x){temp=r2; r2=m[x]; if r1==r2 then m[x]=temp;}

• r is register• m[x] is memory location x

Locks

Page 47: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Write a lock and unlock with test-and-set

• Lock:

• unlock:

Locks

Page 48: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Implementing RMWs

• Bus-based systems: Hold bus and issue load/store operations without any intervening

accesses by other processors

• Scalable systems Acquire exclusive ownership via cache coherence Perform load/store operations without allowing external coherence

requests

Locks

Page 49: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Load-Locked Store-Conditional

• Load-locked Issues a normal load… …and sets a flag and address field

• Store-conditional Checks that flag is set and address matches

If so, performs store and sets store register to 1. Otherwise sets store register to 0.

• Flag is cleared by Invalidation Cache eviction Context switch

lock: lda r2, #1ll r1, lock_variablesc lock_variable, r2beqz r2, lockneqz r1, lock // branch not equal to

zero

unlock:st lock_variable, #0

Locks

Page 50: Multiprocessors continued Quad-core Kentsfield package IBM's Power7 with eight cores and 32 Mbytes eDRAM

© Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar, Wenisch

Test-and-Set Spin Lock (T&S)

• Lock is “acquire”, Unlock is “release”• acquire(lock_ptr):

while (true):// Perform “test-and-set” old = compare_and_swap(lock_ptr, UNLOCKED, LOCKED)if (old == UNLOCKED):

break // lock acquired!// keep spinning, back to top of while loop

• release(lock_ptr):

store[lock_ptr] <- UNLOCKED

• Performance problem CAS is both a read and write; spinning causes lots of invalidations

Locks