29
Faculty of Computer Science Institute for System Architecture, Operating Systems Group Distributed Operating Systems Cache Coherence SS2014 Marcus Völp (slides Julian Stecklina, Marcus Völp)

Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Faculty of Computer Science Institute for System Architecture, Operating Systems Group

Distributed Operating Systems Cache Coherence

SS2014

Marcus Völp (slides Julian Stecklina, Marcus Völp)

Page 2: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

TU Dresden, 5.05.2014 Distributed Operating Systems Slide 2

Concurrent programs

int i; int k;

global variables:

i = 1; if (i > 1) k = 3;

i = i + 1; if (k == 0) k = 4;

||

mov $1, [%i] cmp [%i], $1 jgt end mov $3, [%k] end:

inc [%i] cmp [%k], $0 jne end mov $4, [%k] end:

||

lock;

Page 3: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 3

Symmetric Multiprocessor (SMP)

CPU0

L1

L2

CPU1

L1

L2

CPU2

L1

L2

CPU3

L1

L2

Memory

Processors

Local Caches

Bus or Crossbar

Shared Memory

TU Dresden, 5.05.2014

Page 4: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 4

Chip Multiprocessor (CMP), Multicore

CPU0

L1

L2

CPU1

L1

Memory

CPU2

L1

L2

CPU3

L1

Processors

Local Cache

Bus or Crossbar

Shared Memory

Shared LL-Cache

TU Dresden, 5.05.2014

Page 5: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 5

Symmetric Multithreading (SMT), Hyperthreading

L1

L2

L1

HT0 HT1

HT2 HT3

Memory

L1

L2

L1

HT4 HT5

HT6 HT7

Processors

Local Cache

Bus or Crossbar

Shared Memory

Shared LL-Cache

TU Dresden, 5.05.2014

Page 6: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 6

Non-Uniform Memory Access (NUMA)

CPU0 Memory CPU3 Memory

CPU1 Memory CPU2 Memory

General Interconnect

TU Dresden, 5.05.2014

Page 7: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 7

NUMA Example: Tilera Gx

Source: http://tilera.com

TU Dresden, 5.05.2014

Page 8: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 8

Summary: Memory Organization

• Multiple processors share memory

• Memory access paths through one or more controllers

– UMA (Uniform Memory Access)

– NUMA (Non-Uniform Memory Access)

• Caches / store buffers hold memory content near accessing CPUs.

TU Dresden, 5.05.2014

Page 9: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 9

Cache Coherency

L1

L2

L1

HT0 HT1

HT2 HT3

Memory = =

address tag idx ofs

set

tag RAM

data RAM

TU Dresden, 5.05.2014

Page 10: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 10

Cache Coherency

• Caches lead to multiple copies for the content of a single memory location

• Cache Coherency keeps copies “consistent” – locate all copies

– invalidate/update content

• Write Propagation

writes must eventually become visible to all processors.

• Write Serialization

every processor should see writes to the same location in the same order.

TU Dresden, 5.05.2014

Page 11: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 11

Alternative Definition: SWMR

Single-Writer, Multiple-Reader Invariant For any memory location A, at any given time,

either only a single core may write (or read-modify-write) the content of A

or any number of cores may read the content of A.

Data-Value Invariant The value of a memory location at the start of an

operation is the same as the value at the end of its

last write (read-modify-write) operation.

[based on Sorin et al., 2011]

TU Dresden, 5.05.2014

Page 12: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 12

Attempt 1: write through all caches

CPU0

WT Cache

CPU1

WT Cache

Memory x=0

x=0 x=0 x=1

x=1

CPU0: read x x=0 stored in cache

CPU1: read x x=0 stored in cache

CPU0: write x=1 x=1 stored in cache x=1 stored in memory

CPU1: read x x=0 retrieved from cache Write not visible to CPU1!

TU Dresden, 5.05.2014

Page 13: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 13

Attempt 2: write back

CPU0

WB Cache

CPU1

WB Cache

Memory x=0

x=0 x=0

CPU0: read x x=0 stored in cache

CPU1: read x x=0 stored in cache

CPU0: write x=1 x=1 stored in cache

CPU1: write x=2 x=2 stored in cache

CPU1: writeback x=2 stored in memory

CPU0: writeback x=1 stored in memory

Later store x=2 lost!

x=1 x=2

x=2 x=1

TU Dresden, 5.05.2014

Page 14: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 14

Coherency Problems & Solutions

Both examples violate SWMR!

Problem 1 CPU1 used stale value that had already been modified by CPU0.

– Solution: Invalidate copies before write proceeds!

Problem 2 Incorrect writeback order of modified cache lines.

– Solution: Disallow more than one modified copy!

TU Dresden, 5.05.2014

Page 15: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 15

Coherency Protocol Design Space

• Snooping-based – All coherency related traffic broadcasted to all CPUs

– Each processor snoops and acts accordingly:

• Invalidate lines written by other CPUs

• Signal sharing for lines currently in cache

– Straightforward for bus-based systems

– Suited for small-scale systems

• Directory-based – Uses central directory to track cache line owner

– Update copies in other caches

• Can update all CPUs at once (less traffic for alternating reads and writes)

• Multiple writes need multiple updates (more traffic for subsequent writes)

– Suited for large-scale systems TU Dresden, 5.05.2014

Page 16: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 16

Coherency Protocol Design Space

• Snooping-based vs. Directory-based

Memory

CPU0

L1

CPU1

L1

L2 Snoop Filter

Memory

CPU0

L1

CPU1

L1

L2 Snoop Filter

TU Dresden, 5.05.2014

Page 17: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 17

Invalidation vs. Update Protocols

• Invalidation-based – Only write misses hit bus (suited for WB caches)

– Subsequent writes are write hits

– Good for multiple writes to same cache line by same CPU

• Update-based – All shares of a cache line continue to hit in the cache after

a write by one CPU

– Otherwise lots of useless updates (wastes bandwidth) → Rarely used!

• Hybrid forms are possible!

TU Dresden, 5.05.2014

Page 18: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 18

A Basic Coherency Protocol: MSI

• Modified (M) – No copies on other caches; local copy modifed

– Memory is stale

• Shared (S) – Unmodified copies in one or more caches

– Memory is up-to-date

• Invalid (I) – Not in cache

• States tracked from the view of the cache controller. Sees events from: – Local processor → processor transactions

– Other processors → snoop transactions

TU Dresden, 5.05.2014

Page 19: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 19

MSI: Processor Transitions

• State is I, CPU reads (PrRd) – Generate bus read request (BusRd)

– Go to S

• State is S or M, CPU reads (PrRd) – No transition

• State is S, CPU writes (PrWr) – Upgrade cache line for exclusive ownership (BusRdX)

– Go to M

• State is M, CPU writes (PrWr) – No transition

TU Dresden, 5.05.2014

Page 20: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 20

MSI: Snoop Transitions

• Receiving a read snoop (BusRd) for a cache line – If M, write cache line back to memory (WB), transition to S

– If S, no transition

• Receiving a exclusive ownership snoop (BusRdX) – If M, write cache line back to memory (WB), discard it,

transition to I

– If S, discard cache line, transition to I

TU Dresden, 5.05.2014

Page 21: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 21

MSI State Transitions

I

S

M

Snoop Transitions

Processor Transitions

PrWr → BusRdX

PrWr PrRd

BusRd → WB

BusRd

PrRd

PrRd → BusRd

BusRdX → WB

BusRd BusRdX

BusRdX

PrWr → BusRdX

TU Dresden, 5.05.2014

Page 22: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 22

Problems in MSI

A common usecase is to: – read variable A: S

– Modify A: BusRdX sent, S → M

Invalidation message pointless, if no other cache holds A.

Solved by adding Exclusive (E) state: – No copies exist in other caches

– Memory is up-to-date

Variants of MESI are used by most popular microprocessors.

TU Dresden, 5.05.2014

Page 23: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 23

MESI State Transitions

E

I

S

M

Snoop Transitions

Processor Transitions

PrWr → BusRdX PrWr

PrWr

BusRd → WB

BusRd → HIT

PrRd

PrRd → BusRd (HIT) PrRd → BusRd (!HIT)

BusRdX → WB

BusRd BusRdX

BusRd → HIT

PrWr → BusRdX

BusRdX BusRdX

TU Dresden, 5.05.2014

Page 24: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 24

MOESI: Adding Owned to MESI

• Similar to MESI, with some extensions

• Cache-to-Cache transfers of modified cache lines – Modified cache lines not written back to memory, but

supplied by to other CPUs on BusRd

– CPU  that  had  initial  modified  copy  becomes  “owner” • Avoids writeback to memory when another CPU

accesses cache line – Beneficial when cache-to-cache latency/bandwidth is

better than cache-to-memory latency/bandwidth

• Used by AMD Opteron

TU Dresden, 5.05.2014

Page 25: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 25

MOESI State Transitions

E

I

S

M

Snoop Transitions

Processor Transitions

PrWr → BusRdX

PrWr

PrWr

BusRdX → WB

BusRd → HIT

PrRd

PrRd → BusRd (HIT) PrRd → BusRd (!HIT)

BusRdX → XFER

BusRd BusRdX

BusRd → HIT

PrWr → BusRdX

BusRdX BusRdX

O BusRd → HIT, XFER

PrWr → BusRdX

PrRd

BusRd → HIT, XFER

TU Dresden, 5.05.2014

Page 26: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 26

Coherency in Multi-Level Caches

• Bus only connected to last-level cache (e.g. L2) – Snoop requests are relevant to inner-level caches (e.g. L1)

– Modifications in L1 may not be visible to L2 (and the bus)

• Idea: L2 forwards filters transactions for L1: – On BusRd check if line is M/O in L1 (may be S or E in L2)

– On BusRdX, send invalidate to L1

• Only easy for inclusive caches!

• Inclusion property Outer cache contains a superset of the content of its inner caches.

TU Dresden, 5.05.2014

Page 27: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 105

References

� A Primer on Memory Consistency and Cache Coherence Sorin, Hill, Wood; 2011

� atomic<> Weapons: The C++ Memory Model and Modern Hardware (Video) Sutter; 2013

� Shared memory consistency models: a tutorial Adve, Gharachorloo; 1996

� IA Memory Model Richard Hudson; Google Tech Talk 2008

� Memory Ordering in Modern Microprocessors McKenney; Linux Journal 2005

� How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs Lamport, 1979

� PowerPC Storage Model TU Dresden, 5.05.2014

Page 28: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

References

Scheduler-Conscious Synchronization Leonidas Kontothanassis, Robert Wisniewski, Michael Scott Scalable Reader- Writer Synchronization for Shared-Memory Multiprocessors John M. Mellor-Crummey, Michael L. Scottt Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors John Mellor-Crummey, Michael Scott Concurrent Update on Multiprogrammed Shared Memory Multiprocessors Maged M. Michael, Michael L. Scott Scalable Queue-Based Spin Locks with Timeout Michael L. Scott and William N. Scherer III

Page 29: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

References

Reactive Synchronization Algorithms for Multiprocessors B. Lim, A. Agarwal

Lock Free Data Structures John D. Valois (PhD Thesis)

Reduction: A Method for Proving Properties of Parallel Programs R. Lipton - Communications of the ACM 1975 Decoupling Contention Management from Scheduling (ASPLOS 2010) F.R. Johnson, R. Stoica, A. Ailamaki, T. Mowry Corey: An Operating System for Many Cores (OSDI 2008) Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, Zheng Zhang