Cache Coherence Protocols for Chip Multiprocessors - Ijohnmc/comp522/lecture-notes/COMP... · Cache Coherence Protocols for Chip Multiprocessors - I COMP 522 Lecture 5 6 September

John Mellor-Crummey

Department of Computer Science Rice University

[email protected]

Cache Coherence Protocols for Chip Multiprocessors - I

COMP 522 Lecture 5 6 September 2016

2

Context• Thus far

—chip multiprocessors —hardware threading strategies

– simultaneous multithreading – fine-grain multithreading

—future microprocessor issues and trends

• Today: sharing cache in chip multiprocessors

Context

• Thus far —chip multiprocessors —hardware threading strategies

– simultaneous multithreading – fine-grain multithreading

—future microprocessor issues and trends

• Today: sharing cache in chip multiprocessors —cache coherence —victim replication

3

4

Today’s References• Chapter 6: Coherence Protocols; Chapter 7 Snooping

Coherence Protocols; Chapter 8: Directory Coherence Protocols. A Primer on Memory Consistency and Cache Coherence. Daniel J. Sorin, Mark D. Hill, David A. Wood Synthesis Lectures on Computer Architecture. Morgan Claypool. 2011.

• Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. In Proceedings 32nd International Symposium on Computer Architecture, Madison, WI, June 2005.

5

A Primer on Caching and Coherence

6

Cache• A content-addressable memory used to store data items so

that future requests will be served faster —reduces average latencies to access storage

• How does a datum become stored in a cache? —value written by an earlier computation —duplicate of a value available from storage elsewhere

• Results of load/store operations —cache hit

– requested data is present in cache —cache miss

– requested data is not present in cache

7

Consistency vs. Coherence• Consistency models (aka memory models)

—define correct shared memory behavior in terms of loads and stores (memory reads and writes), without reference to caches or coherence

—can stores be seen out of order? if so, under what conditions? – sequential consistency vs. weak memory models

• Coherence —problems can arise if multiple actors (e.g., multiple cores) have

access to multiple copies of a datum (e.g., in multiple caches) and at least one access is a write – must appear to be one and only one value per memory location

—access to stale data (incoherence) is prevented using a coherence protocol – set of rules implemented by the distributed actors within a system

Goal of Coherence ProtocolsMaintain coherence by enforcing the following invariants

• Single-Writer, Multiple-Reader (SWMR) Invariant —for any memory location A, at any given time, there exists only a

single core that may write to A (that core can also read it) or some number of cores that may only read A

• Data-Value Invariant —the value of the memory location at the start of an epoch is the

same as its value at the end of its last read-write epoch

8

Implementing Coherence Invariants• Hardware: typical of systems today

—each cache and the LLC/memory has an associated a finite state machine known as a coherence controller – set of controllers form a distributed system

—controllers exchange messages to ensure that, for each block, the SWMR and data value invariants are maintained at all times

• Software —relies on compiler and/or runtime support

– may or may not have help from the hardware —must be conservative to be safe

– assume the worst about potential memory aliases —of increasing interest

– concerns about cost of coherence in joules – scales well for microprocessors based on “tiled” designs

Intel Scalable Cloud Computer (SCC), 2010

9

Cache Controller

• Cache controller accepts loads and stores from the core and returns load values to the core

• On a cache miss, a controller initiates a coherence transaction by issuing a coherence request for the block containing the location accessed by the core

• Cache controller listens for and responds to coherence requests from other caches

• Implements a set of finite state machines—logically per block—and receives and processes events (e.g., incoming coherence messages) depending upon the block’s state

10

Figure Credit: Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan Claypool. 2011.

11

State Diagram for 4-state Invalidate Protocol

MESI states: Modified, Exclusive, Shared, and Invalid

MESI figure credit: http://sc.tamu.edu/Images/MESI.png (Copyright Michael Thomadakis, Texas A&M 2009-2011)

M E S I

M

E

S

I

Permissible state pairs for a pair of

caches

Memory Controller

• Memory controller —similar to a cache controller —listens for and responds to

coherence requests from caches

• Only a network side —does not

– issue coherence requests (on behalf of loads or stores)

– receive coherence responses

12Figure Credit: Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan Claypool. 2011.

13

Snooping Coherence

• Snoopy cache systems —broadcast all

invalidates and read requests

—all coherence controllers listen and perform appropriate coherence operations locally


Operation of Snoopy Caches • Once a datum is tagged modified or exclusive

—all subsequent operations can be performed locally in cache —no external traffic needed

• If a data item is read by a number of processors —transitions to the shared state in all caches —all subsequent read operations become local

• If multiple processors read and update data —generate coherence requests on the bus —bus is bandwidth limited: imposes a limit on updates per second

14

15

Directory-based Coherence• Snooping protocol: a cache controller initiates a request for a

block by broadcasting a request message to all other coherence controllers

• A directory maintains a global view of each block —tracks which caches hold each block and in what states

• Directory protocol: a cache controller initiates a request for a block by sending it to the memory controller that is the home for that block


Intel MESIF Protocol (2005)• MESIF: Modified, Exclusive, Shared, Invalid and Forward

• If a cache line is shared —one shared copy of the cache line is in the F state —remaining copies of the cache line are in the S state

• Forward (F) state designates a single copy of data from which further copies can be made —cache line in the F state will respond to a request for a copy of

the cache line —consider how one embodiment of the protocol responds to a read

– newly created copy is placed in the F state – cache line previously in the F state is put in the S or the I state

16

H. Hum et al. US Patent 6,922,756. July 2005. http://bit.ly/gQNkRR

17

Dance-Hall Shared Cache CMPsNiagara1

• L1 cache co-located with PE

• PEs on far side of interconnect from L2 cache —each L2 cache equidistant from all cores

Figure credit: Niagara: A 32-Way Multithreaded SPARC Processor, P. Kongetira, K. Aingaran, and K. Olukotun, IEEE Micro, pp. 21-29, March-April 2005.

Blue Gene/Q’s BGC Chip (2012)• System on a chip

—processor, memory, network logic

• 360mm2, 1.47B transistors

• 16 user + 1 service cores + 1 spare core —all cores are symmetric —4-way SMT per core

• Shared L2 cache: 32MB eDRAM —multi-versioned cache

– transactional memory, speculative execution, atomic operations

—latency ~80 cycles

• Dual memory controller —16GB external DDR3 memory —1.3GB/s —2 x 16 byte wide interface (+ECC)

• Chip-to-chip networking —integrated router for 5D torus

18

Figure and information credit: Blue Gene/Q compute chip. Ruud Haring. Hot Chips 23. August 2011. http://bit.ly/QWq1ID

19

Emerging Tiled Architectures

• Trends —more processor cores —larger cache sizes —deeper cache hierarchies

• Implications —wire delay of tens of clock cycles across chip —worst case latency: likely unacceptable hit times

• Tiled chip multiprocessors approach —co-locate part of shared cache near each core

– reduce access latency to (at least some) shared data

20

Tiled Chip Multiprocessors

Advantages

• Simpler replicated physical design —readily scale to larger processor counts

• Can support product families with different number of tiles

Figure credit: Victim Replication: Maximizing Capacity while Hiding

Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K.

Asanovic. ISCA, June 2005.

21

Alternatives for Managing Tiled L2 in CMPs

Treat each slice as a private L2 cache per

tile (L2P)

Manage all slices as a single large shared L2

cache (L2S)

Figure credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005.

22

Implications of L2 Caching Strategy

• Manage each slice as a private L2 cache per tile • use directory approach to keep caches coherent

• tags duplicated and distributed across tiles by set index + delivers lowest hit latency

• works well when your working set fits in your private L2 – reduces total effective cache capacity

• each tile has a local copy of each line it touches • can’t borrow L2 space from other PEs with less full caches

• Manage all slices as a single large shared L2 cache • focus: NUCA (non-uniform cache architecture) designs

• differs from dancehall design in Niagara and Blue Gene/Q + shared L2 increases effective cache capacity for shared data – incur long hit latencies when L2 data is on remote tile – migration-based NUCA protocols seem problematic

23

Victim Replication

• Combines advantages of private and shared L2$ schemes

• Variant of shared scheme

• Attempts to keep copies of local L1$ victims in local L2$ —retained victim is a replica of one in an L2 on remote home tile

24

Victim Replication in Action

Dynamically build a small victim cache in L2

• Processor misses in shared L2 —bring line from memory —place in L2 in a home tile determined by subset of address bits —also bring into L1 of requester

• Incoming invalidation to a processor —follow usual L2S protocol (check local L1 and L2)

• If L1 line is evicted on conflict or capacity miss —attempt to copy victim line into local L2

• Primary cache misses must check for local replica on miss —no replica: forward request to home tile —on replica hit: invalidate replica in local L2, move to local L1

25

Victim Replacement Policy

• Never evict global shared line in favor of local replica

• L2VR replaces lines in following priority order —invalid line —global line with no sharers —existing replica

• If no lines belong to these categories —no replica is made in local L2 cache —victim evicted from the tile as in L2S

• More than one candidate line? —pick at random

26

Advantages of Victim Replication

• Hits to replicated copies reduce effective latency of shared L2

• Higher effective capacity for shared data than private L2

27

Victim Replication Evaluation Parameters

Table credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005.

8-way CMP: 4x2 grid

• Associativity is 2x #PE • Problematic for large tiled CMP?

28

VR Single-threaded Benchmarks

Table credit: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, M. Zhang and K. Asanovic. ISCA, June 2005.

29

Single Threaded Access Latencies


• L2VR adapts to provide 3-level hierarchy: L1, local L2, remote L2

• L2S latency is higher than competitors for single-threaded programs

• L2VR latency is close to L2P

Lower is better!

30

Single Threaded Off-chip Miss Rate


• Lower miss rates than L2P • Slightly higher than L2S

Lower is better!

31

Single Threaded On-chip Coherence Traffic


• 71% fewer coherence msg hops using L2VR than L2S

• L2VR comparable to L2P

Lower is better!

32

VR Multithreaded Benchmarks


33

Multi-Threaded Average Access Latency

IS fit in L1BT, FT, LU, SP, apache

fit in L2 sliceMG, EP, checkers

better with L2VR

L2 slice = 1MB

CG almost fits in private L2 cache; low latency of L2P helps high (9%)

L1 miss rate


Lower is better!

34

Multi-Threaded Off-chip Miss Rates

MG and EP improve with L2VR: they have fewer off-

chip misses than L2P

CG almost fits in L2 cache; L1 miss latency of L2P dominates

cost of off-chip traffic


dbench: high miss rates regardless

Lower is better!

35

Multi-Threaded On-chip Coherence Traffic

L2VR has less coherence traffic

than L2S


Lower is better!

36

MT: Avg % L2$ as Replica Over Time


L2VR is adaptive: differs across applications;

differs over time

37

MT Memory Access Breakdown (L2P, L2S, L2VR)


Ideally: low # of misses,

most hits in local L2

38

Victim Replication Summary

• Distributed shared L2 caches decrease off-chip traffic vs. private caches at the expense of latency

• Victim replication reduces on-chip latency by replicating cache lines within same level of cache — near threads that are actively accessing the line

• Result: dynamically self-tuning hybrid between private and shared caches

Multithreaded benchmark results summary — in most cases, L2VR creates enough replicas so

that performance is usually within 5% of L2P — L2VR reduces memory latency by avg. of 16%

compared to L2S — CG is the only case where L2P significantly

outperforms both L2S and L2VR (almost fits in private L2)

Additional References

• Victim Caching —Jouppi, N. P. 1990. Improving direct-mapped cache performance

by the addition of a small fully-associative cache and prefetch buffers. SIGARCH Comput. Archit. News 18, 3a (Jun. 1990), 364-373. DOI= http://doi.acm.org/10.1145/325096.325162

• Shared Caches in Multicores: The Good, The Bad, and The Ugly. Mary Jane Irwin. Athena Award Lecture. International Symposium on Computer Architecture, Saint-Malo, France, June 2010.

39

Documents

Cache Coherence Protocols for Chip Multiprocessors - Ijohnmc/comp522/lecture-notes/COMP... · Cache Coherence Protocols for Chip Multiprocessors - I COMP 522 Lecture 5 6 September