14

Click here to load reader

Lecture 10: Cache Coherence - ida.liu.seTDTS08/lectures/17/lec10.pdf · Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 10 Directory vs. Snoopy Schemes

Embed Size (px)

Citation preview

Page 1: Lecture 10: Cache Coherence - ida.liu.seTDTS08/lectures/17/lec10.pdf · Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 10 Directory vs. Snoopy Schemes

2017-12-04

1

11Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Lecture 10: Cache Coherence

Introduction

Snoopy protocols

Directory protocols

L1-L2 consistency

22Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Review: What is a Cache? Small, fast memory used to improve memory system performance. It exploits spatial and temporal locality of typical programs. We have conceptually many caches:

Registers ─ a cache for first-level cache; First-level cache ─ a cache for second-level cache; Second-level cache ─ a cache for main memory; and Main memory ─ a cache for hard disk (virtual memory).

Regs

L1-Cache

L2-Cache

Main Memory

Disk, Tape, etc.

Bigger Faster

Processor

Page 2: Lecture 10: Cache Coherence - ida.liu.seTDTS08/lectures/17/lec10.pdf · Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 10 Directory vs. Snoopy Schemes

2017-12-04

2

33Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

What Happen with Multiprocessor?

Different processors may access values at same memory location Multiple copies of the same data in different caches.

How to ensure data integrity at all times? An update by a processor at time t should be available for other

processors at time t+1.

I/O may also address main memory directly.

Pi

Pj

M

$

$I/O

In particular, WRITE operations must be carefully coordinated.

44Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Write-Through Policy

All writes go to main memory as well as cache.

Cache and memory are consistent.

Write slows down to main memory access.

OK, since the write percentage is small (ca. 15%).

For multiprocessor:

There is additional inconsistency due to other cache copies of the same memory location.

The processors must therefore monitor main memory traffic to keep local cache up to date.

May lead to a lot of traffic and monitoring activities.

Page 3: Lecture 10: Cache Coherence - ida.liu.seTDTS08/lectures/17/lec10.pdf · Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 10 Directory vs. Snoopy Schemes

2017-12-04

3

55Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Write-Back Policy

Updates initially made in cache only.

Cache and memory are not consistent by nature.

Update bit for cache slot is set when update occurs.

If a block is to be replaced, write it back to the main memory if its update bit is set.

For multiprocessor:

Data in the other caches will also be inconsistent.

Mechanism must be used to maintain cache coherence.

I/O must access main memory also through cache.

66Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Software Solutions Based on code analysis:

Determine which data items may become unsafe for caching,e.g., global variables used and updated by several processes.

Mark them, so that they are not cached, which means that readingof these data will also slow down to memory access.

One can also determine the unsafe periods and insert code to enforce cache coherence.

Compiler and OS deal with the problem.

Overhead is transferred to compile time.

Design complexity is transferred from hardware to software.

However, software tends to make conservative decisions:

Inefficient cache utilization.

Reduced performance of the memory system.

Page 4: Lecture 10: Cache Coherence - ida.liu.seTDTS08/lectures/17/lec10.pdf · Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 10 Directory vs. Snoopy Schemes

2017-12-04

4

77Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Hardware Solutions

Implement hardware cache coherence protocols.

Dynamic recognition of potential problems.

Run-time solution.

More efficient use of cache.

Transparent to programmers.

Two main techniques:

Directory protocols

Snoopy protocols

Lead to better performance than SW solutions.

88Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Lecture 10: Cache Coherence

Introduction

Snoopy protocols

Directory protocols

L1-L2 consistency

Page 5: Lecture 10: Cache Coherence - ida.liu.seTDTS08/lectures/17/lec10.pdf · Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 10 Directory vs. Snoopy Schemes

2017-12-04

5

99Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Directory Protocols A directory is used to collect and maintain information

about copies of data in caches.

The directory is stored in the memory system.

Memory references are checked against the directory.

Appropriate transfers are performed.

Memory access may be avoided by copying from another cache rather than fetching the data from the memory.

It creates a central bottleneck = the directory. It can partially be solved by having multiple directories.

Effective in large scale systems with complex intercon-nection schemes, such as NUMA.

1010Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Directory Protocol Example

Cache Coherence Directory

Each node maintains a directory for a portions of memory and cache status.

NUMA Architecture

Page 6: Lecture 10: Cache Coherence - ida.liu.seTDTS08/lectures/17/lec10.pdf · Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 10 Directory vs. Snoopy Schemes

2017-12-04

6

1111Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Memory Access Sequence

Processor 3 of Node 2 (P2-3) requests location 798, which is located in the memory of node 1:

P2-3 issues read request on snoopy bus of node 2.

Directory on node 2 recognizes location is on node 1.

Node 2 directory requests node 1’s directory.

Node 1 directory requests the cache line containing 798.

Node 1 memory puts data on (node 1 local) bus.

Node 1 directory gets data from (node 1 local) bus.

Data transferred to node 2’s directory.

Node 2 directory puts data on (node 2 local) bus.

Data picked up, put in P2-3’s caches and delivered to processor.

1212Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Cache Coherence Operations

Node 1 directory keeps note that node 2 has copy of data (of address 798).

If the data is modified in the cache, it is broadcasted to other nodes.

Local directories will monitor and purge local caches if needed.

The local directory, which owns the address, will

monitor changes in remote caches and marks the memory location as invalid until written back.

force write back if the memory location is requested by another processor.

Page 7: Lecture 10: Cache Coherence - ida.liu.seTDTS08/lectures/17/lec10.pdf · Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 10 Directory vs. Snoopy Schemes

2017-12-04

7

1313Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Lecture 10: Cache Coherence

Introduction

Snoopy protocols

Directory protocols

L1-L2 consistency

1414Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Snoopy Protocols Distribute cache coherence task among all cache controllers.

If the initial cache controller recognizes that a line is shared, the updates will be announced to all other caches.

Each cache controller “snoop” on the network to observe these broadcasted notifications, and react accordingly.

Ideally suited to a bus based multiprocessor system: The bus provides a simple means of broadcasting and snooping. However, it increases bus traffic due to broadcasting and snooping.

Two approaches: write invalidate and write update.

$

P

$

P

$

P. . .M

Page 8: Lecture 10: Cache Coherence - ida.liu.seTDTS08/lectures/17/lec10.pdf · Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 10 Directory vs. Snoopy Schemes

2017-12-04

8

1515Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Write Invalidate SP Suitable for multiple readers, but one writer at a time.

Generally, a line may be shared among several caches for reading purposes.

When one of the caches wants to write to the line, it first issues a notice to invalidate the line in all other caches.

The writing processor then has exclusive access until the line is required by another processor.

Used in Pentium and PowerPC systems.

The state of every cache line is marked as Modified, Exclusive, Shared or Invalid.

The MESI Protocol.

1616Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Snoopy Cache Organization

Snoop HW

Processor

Cache

Processor

Cache . . .

Processor

Cache

Snoop HW Snoop HWAddress/data Bus

Memory

Page 9: Lecture 10: Cache Coherence - ida.liu.seTDTS08/lectures/17/lec10.pdf · Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 10 Directory vs. Snoopy Schemes

2017-12-04

9

1717Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

MESI Status

Modified modified and differs from memory.

Exclusive same as in memory and not in other caches.

Shared same as in memory and present in other caches.

Invalid contends not valid.

Snoop HW

Processor

Cache

Processor

Cache . . .

Processor

Cache

Snoop HW Snoop HWAddress/data Bus

Memory

1818Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

WH

RH RH

RH

MESI State Transition DiagramCache line at the initialing processor(next event triggered by processor)

Invalid Shared

Modif Excl

Invalid Shared

Modif Excl

SHW

SH

R

SHR

fill copyback Invalidate Read-with-intent-to-modify

RMS

WM

SH

W

WH

Line in snooping cache(s)

A cache line has several words!

(next event triggered by bus)

Page 10: Lecture 10: Cache Coherence - ida.liu.seTDTS08/lectures/17/lec10.pdf · Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 10 Directory vs. Snoopy Schemes

2017-12-04

10

1919Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Write Update SP

Work well with multiple readers and writers.

Updated word is distributed to all other processors.

$

P

$

P

$

P . . .M

It may generate many unnecessary updates: If a processor just reads a value once and does not need it

again; or

If a processor updates a value many times before it is read by the other processors (bad programming).

2020Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Invalidate vs. Update Protocols An update protocol may generate many unnecessary cache

updates.

However, if two processors make interleaved reads and updates to a variable, an update protocol is better. An invalidate protocol may lead to many memory accesses.

Both protocols suffer from false sharing overheads: Two words are not shared, however, they lie on the same cache line.

Most modern machines use invalidate protocols, since we have usually the situation of one writer with many readers.

$

P

$

P

$

P . . .M

Page 11: Lecture 10: Cache Coherence - ida.liu.seTDTS08/lectures/17/lec10.pdf · Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 10 Directory vs. Snoopy Schemes

2017-12-04

11

2121Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Directory vs. Snoopy Schemes Snoopy caches:

Each coherence operation is sent to all other processors.

It generates large traffic, which is an inherent limitation.

Easy to implement on a bus-based system.

Not feasible for machines with memory distributed across a large number of sub-systems (e.g., NUMA).

Directory caches: The need for a broadcast media is replaced by a directory.

The additional information stored in the directory may add significant overhead.

The underlying network must also be able to carry out all the coherence requests.

The directory becomes a point of contention, therefore, distributed directory schemes (with many directories) are often used.

2222Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Lecture 10: Cache Coherence

Introduction

Snoopy protocols

Directory protocols

L1-L2 consistency

Page 12: Lecture 10: Cache Coherence - ida.liu.seTDTS08/lectures/17/lec10.pdf · Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 10 Directory vs. Snoopy Schemes

2017-12-04

12

2323Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

L1-L2 Cache Consistency Cache coherence techniques

Apply only to caches connected to a bus or other interconnection mechanism typically L2 caches.

However, a processor often has L1 cache that is not connected to a bus, therefore no snoopy protocol can be used.

2424Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

L1-L2 Cache Consistency Solution

To extend cache coherence protocols to L1 caches:

L1 line should keep track of the state of the corresponding L2 line, and L1 should write-through to L2.

This requires:

L1 must be a subset of L2.

The associativity of the L2 cache should be equal or greater than that of the L1 cache.

• Ex. if L2 is 2-way set associate while L1 is 4-way set associate, it doesn’t work.

If L1 has a write-back policy, the interaction between L1 and L2 will be more complex.

L1 $

P

L2 $

Page 13: Lecture 10: Cache Coherence - ida.liu.seTDTS08/lectures/17/lec10.pdf · Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 10 Directory vs. Snoopy Schemes

2017-12-04

13

2525Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Inclusive vs Exclusive Caches Inclusive caches:

Each block existing in the first level also exists in the next level.

When fetching a memory block, place it in all cache levels.

- Duplicating data in several caches ─ less efficient.

- Maintaining inclusion by forced evictions.

+ Making cache coherence easier, since it needs to track other processors’ accesses only in the highest-level cache.

Exclusive caches:

The blocks in different cache levels are mutually exclusive.

+ More efficient utilization of cache space.

- More levels to keep track of to ensure cache coherence.

Non-inclusive caches: No guarantee for inclusion or exclusion.

Simpler design, ex. most Intel processors: Pentium II, III, 4.

2626Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

An Example: Alpha-Server 4100

Four-processor shared-memory symmetric multi-processor system.

Each processor has a three-level cache hierarchy:

L1 consists of two direct-mapped on-chip caches, one for instruction and one for data (Harvard architecture).

• Write-through to L2 with a write buffer.

L2 is an on-chip 3-way set associative cache with write-back to L3.

L3 is a off-chip direct-mapped cache with write-back to main memory.

NOTE: L3’s associativity level is lower than L2’s, which is not a problem ─ non-inclusive cache organization.

Page 14: Lecture 10: Cache Coherence - ida.liu.seTDTS08/lectures/17/lec10.pdf · Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 10 Directory vs. Snoopy Schemes

2017-12-04

14

2727Zebo Peng, IDA, LiTHZebo Peng, IDA, LiTH TDTS 08 – Lecture 10TDTS 08 – Lecture 10

Summary

Cache coherence in multiprocessor systems is an important issue to be considered.

Otherwise, performance will suffer.

Additional hardware is required to coordinate access to data that might have multiple cache copies.

The underlying technique must provide guarantees on the correct semantics.

Both hardware and software solutions can be used.

There are several different protocols to be selected for the hardware solutions.