Lecture 2. Snoop-based Cache Coherence Protocols

Lecture 2. Snoop-based Cache Coherence Protocols

Prof. Taeweon SuhComputer Science Education

Korea University

COM503 Parallel Computer Architecture & Programming

Korea Univ

Flynn’s Taxonomy

• A classification of computers, proposed by Michael J. Flynn in 1966 Characterize computer designs in terms of the

number of distinct instructions issued at a time and the number of data elements they operate on

2

Single Instruction

Multiple Instruction

Single Data SISD MISD

Multiple Data SIMD MIMD

Source: Widipedia

Korea Univ

Flynn’s Taxonomy (Cont.)

• SISD Single Instruction Single Data Uniprocessor Example: Your desktop (notebook)

computer before the spread of dual or more core CPUs

• SIMD Single Instruction Multiple Data Each processor works on its own data

stream But all processors execute the same

instruction in lockstep Example: MMX and GPU

3Picture sources: Wikipedia

Korea Univ

SIMD Example

• MMX (Multimedia Extension) 64-bit registers == 2 32-bit integers, 4 16-bits integers, or 8 8-bit

integers processed concurrently• SSE (Streaming SIMD Extensions)

256-bit registers == 4 DP floating-point operations

4

Korea Univ

Flynn’s Taxonomy (Cont.)

• MISD Multiple Instruction Single Data Each processor executes

different instructions on the same data

Not used much

• MIMD Multiple Instruction Multiple Data Each processor executes its own

instruction for its own data Virtually, all the multiprocessor

systems are based on MIMD

5Pic ture sources: Wikipedia

Korea Univ

Multiprocessor Systems

• Shared memory systems Bus-based shared memory Distributed shared memory

• Current server systems (for example, Xeon-based servers)

• Cluster-based systems Supercomputers and datacenters

6

Korea Univ

Clusters

7http://www.tik.ee.ethz.ch/~ddosvax/cluster/

Supercomputer dubbed 7N (Cluster computer), 95th fastest in the world on the TOP500 in 2007

https://www.jlab.org/news/releases/jefferson-lab-boasts-virginias-fastest-computer

http://www.tik.ee.ethz.ch/~ddosvax/cluster/

https://www.jlab.org/news/releases/jefferson-lab-boasts-virginias-fastest-computer

Korea Univ

Shared Memory Multiprocessor Models

8

P P P$

Bus-based shared memory

$ $

Memory

P P P$

Memory

Fully-connected shared memory

(Dancehall)

$ $

Memory

Interconnection Network

P

$Memory

Interconnection Network

P

$Memory

Distributed shared memory

Our Focus today

Korea Univ

Some Terminologies

• Shared memory systems can be classified into UMA (Uniform Memory Access) architecture NUMA (Non-Uniform Memory Access) architecture

• SMP (Symmetric Multiprocessor) is an UMA example Don’t be confused with SMT (Simultaneous

Multithreading)

9

Korea Univ

SMP (UMA) Systems

10http://www.evga.com/forums/tm.aspx?m=1897631&mpage=1

Antique (?) P-III based SMP

Sandy Bridge based motherboard

http://news.softpedia.com/newsImage/Gigabyte-Also-Details-Its-Sandy-Bridge-Motherboard-Replacement-Program-2.jpg/

Memory

P-IIIP-III$ $

http://www.evga.com/forums/tm.aspx?m=1897631&mpage=1



Korea Univ

DSM (NUMA) Machine Examples

• Nehalem-based systems with QPI

11http://www.qdpma.com/systemarchitecture/SystemArchitecture_QPI.html

Nehalem-based

Xeon 5500

QPI: QuickPath Interconnect

http://www.qdpma.com/systemarchitecture/SystemArchitecture_QPI.html

http://www.qdpma.com/systemarchitecture/SystemArchitecture_QPI.html

Korea Univ

More Recent NUMA System

12http://ark.intel.com/products/64596/Intel-Xeon-Processor-E5-2690-20M-Cache-2_90-GHz-8_00-GTs-Intel-QPIhttp://www.intel.in/content/www/in/en/intelligent-systems/crystal-forest-server/xeon-e5-2600-e5-2400-89xx-ibd.html

http://www.anandtech.com/show/6533/gigabyte-ga7pesh1-review-a-dual-processor-motherboard-through-a-scientists-eyes

http://ark.intel.com/products/64596/Intel-Xeon-Processor-E5-2690-20M-Cache-2_90-GHz-8_00-GTs-Intel-QPI

http://ark.intel.com/products/64596/Intel-Xeon-Processor-E5-2690-20M-Cache-2_90-GHz-8_00-GTs-Intel-QPI



Korea Univ

Amdahl’s Law (Law of Diminishing Returns)

• Amdahl’s law is named after computer architect Gene Amdahl

• It is used to find the maximum expected improvement to an overall system

• The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program

13

Maximum speedup = (1 – P) + P / N

1

• P: Parallelizable portion of a program • N: # processors

Source: Widipedia

Korea Univ

WB & WT Caches

14

CPU core

Cache

MemoryX= 100

X= 100

Writeback

X= 300

CPU core

Cache

MemoryX= 100

X= 100

Writethrough

X= 300

X= 300

Korea Univ

Definition of Coherence

• Coherence is a property of a shared-memory architecture giving the illusion to the software that there is a single copy of every memory location, even if multiple copies exist

• A multiprocessor memory system is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order

15Modified Slide from Prof. H.H. Lee in Georgia Tech

Memory

P-IIIP-III$ $

Korea Univ

Definition of Coherence

• A multiprocessor memory system is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order

• Implicit definition of coherence Write propagation

• Writes are visible to other processes Write serialization

• All writes to the same location are seen in the same order by all processes

16Slide from Prof. H.H. Lee in Georgia Tech

Korea Univ

Why Cache Coherency?

• Closest cache level is private• Multiple copies of cache line

can be present across different processor nodes

• Local updates (writes) leads to incoherent state Problem exhibits in both write-

through and writeback caches


Core i7

L2 Cache (256KB)

CPU Core

Reg File

L1 I$ (32KB)

L1 D$ (32KB)

L3 Cache (8MB) - Shared

L2 Cache (256KB)

CPU Core

Reg File

L1 I$ (32KB)

L1 D$ (32KB)

..

Korea Univ

Writeback Cache w/o Coherence

18

P

Cache

Memory

P

X= 100

X= 100Cache

P

CacheX= 100X= 505

read?

X= 100

read? write

Slide from Prof. H.H. Lee in Georgia Tech

Korea Univ

Writethrough Cache w/o Coherence

19

P

Cache

Memory

P

X= 100

X= 100Cache

P

CacheX= 100X= 505

X= 505

X= 505

Read? write


Korea Univ

Cache Coherence Protocols According to Caching Policies

• Write-through cache Update-based protocol Invalidation-based protocol

• Writeback cache Update-based protocol Invalidation-based protocol

20

Korea Univ

Bus Snooping based on Write-Through Cache

• All the writes will be shown as a transaction on the shared bus to memory

• Two protocols Update-based Protocol Invalidation-based Protocol


Korea Univ

Bus Snooping• Update-based Protocol on Write-Through cache

22

P

Cache

Memory

P

X= 100

X= 100Cache

P

CacheX= 505

Bus transaction

Bus snoopX= 505

X= 505 X= 100

write


Korea Univ

Bus Snooping• Invalidation-based Protocol on Write-Through

cache

23

P

Cache

Memory

P

X= 100

X= 100Cache

P

CacheX= 505

Bus transaction

Bus snoop

X= 505

Load X

X= 505

write

X= 100


Korea Univ

A Simple Snoopy Coherence Protocol for a WT, No Write-Allocate Cache

24

Invalid

Valid

PrRd / BusRd

PrRd / --- PrWr / BusWr

BusWr / ---

PrWr / BusWr Processor-initiated TransactionBus-snooper-initiated Transaction

Observed / Transaction


Korea Univ

How about Writeback Cache?

• WB cache to reduce bandwidth requirement

• The majority of local writes are hidden behind the processor nodes

• How to snoop?

• Write Ordering


Korea Univ

Cache Coherence Protocols for WB Caches

• A cache has an exclusive copy of a line if It is the only cache having a valid copy Memory may or may not have it

• Modified (dirty) cache line The cache having the line is the owner of the

line, because it must supply the block


Korea Univ

Update-based Protocol on WB Cache

27

P

Cache

Memory

P

Cache

P

Cache

Bus transaction

X= 100X= 100X= 100

Store X

X= 505

updateupdate

X= 505X= 505

• Update data for all processor nodes who share the same data• Because a processor node keeps updating the memory location, a lot of

traffic will be incurredSlide from Prof. H.H. Lee in Georgia Tech

Korea Univ

Update-based Protocol on WB Cache

28

P

Cache

Memory

P

Cache

P

Cache

Bus transaction

X= 505X= 505X= 505

Load X

Hit !

Store X

X= 333

update update

X= 333X= 333

• Update data for all processor nodes who share the same data• Because a processor node keeps updating the memory

location, a lot of traffic will be incurredSlide from Prof. H.H. Lee in Georgia Tech

Korea Univ

Invalidation-based Protocol on WB Cache

• Invalidate the data copies for the sharing processor nodes• Reduced traffic when a processor node keeps updating the

same memory location29

P

Cache

P

Cache

P

Cache

Bus transaction

X= 100X= 100X= 100

Store X

invalidateinvalidate

X= 505

Memory


Korea Univ


30

P

Cache

P

Cache

P

Cache

Bus transaction

X= 505

Load X

Bus snoop

Miss !Snoop hit

X= 505

Memory


same memory locationSlide from Prof. H.H. Lee in Georgia Tech

Korea Univ


31

P

Cache

P

Cache

P

Cache

Bus transaction

X= 505

Store X

Bus snoop

X= 505X= 333

Store X

X= 987

Store XX= 444


same memory location

Memory


Korea Univ

MSI Writeback Invalidation Protocol

• Modified Dirty Only this cache has a valid copy

• Shared Memory is consistent One or more caches have a valid copy

• Invalid

• Writeback protocol: A cache line can be written multiple times before the memory is updated


Korea Univ


• Two types of request from the processor PrRd PrWr

• Three types of bus transactions posted by cache controller BusRd

• PrRd misses the cache• Memory or another cache supplies the line

BusRdX (Read-to-own)• PrWr is issued to a line which is not in the Modified state

BusWB• Writeback due to replacement• Processor does not directly involve in initiating this

operation


Korea Univ

MSI Writeback Invalidation Protocol(Processor Request)

34

Modified

Invalid

Shared

PrRd / BusRd

PrRd / ---

PrWr / BusRdX

PrWr / ---

PrRd / ---

PrWr / BusRdX

Processor-initiated


Korea Univ

MSI Writeback Invalidation Protocol(Bus Transaction)

35

• Flush data on the bus• Both memory and requestor will

grab the copy• The requestor get data from

either Cache-to-cache transfer; or Memory

Modified

Invalid

Shared

Bus-snooper-initiated

BusRd / ---

BusRd / Flush

BusRdX / Flush BusRdX / ---


Korea Univ

MSI Writeback Invalidation Protocol(Bus transaction) Another possible

Implementation

36

Modified

Invalid

Shared


BusRd / ---

BusRd / Flush


• Anticipate no more reads from this processor

• A performance concern• Save “invalidation” trip if the

requesting cache writes the shared line later

BusRd / Flush


Korea Univ


37

Modified

Invalid

Shared


BusRd / ---

PrRd / BusRd

PrRd / ---

PrWr / BusRdX

PrWr / ---

PrRd / ---

PrWr / BusRdX

Processor-initiated

BusRd / Flush



Korea Univ

MSI Example

38

P1

Cache

P2 P3

Bus

Cache Cache

MEMORY

BusRd

Processor Action State in P1 State in P2 State in P3 Bus Transaction Data SupplierS --- --- BusRd MemoryP1 reads X

X=10

X=10 S


Korea Univ

MSI Example

39

P1

Cache

P2 P3

Bus

Cache Cache

MEMORY

X=10 S


P3 reads X

BusRd

X=10 S

S --- S BusRd Memory

X=10


Korea Univ

MSI Example

40

P1

Cache

P2 P3

Bus

Cache Cache

MEMORY

X=10 S


P3 reads X

X=10 S


P3 writes X

BusRdX

--- I M

I --- M BusRdX

X=10

X=-25


Korea Univ

MSI Example

41

P1

Cache

P2 P3

Bus

Cache Cache

MEMORY


P3 reads X

X=-25 M


P3 writes X

--- I

I --- M BusRdXP1 reads X

BusRd

X=-25 S S

S --- S BusRd P3 Cache

X=10X=-25


Korea Univ

MSI Example

42

P1

Cache

P2 P3

Bus

Cache Cache

MEMORY


P3 reads X

X=-25 M


P3 writes X I --- M BusRdXP1 reads X

X=-25 S S

S --- S BusRd P3 Cache

X=10X=-25

P2 reads X

BusRd

X=-25 S

S S S BusRd MemorySlide from Prof. H.H. Lee in Georgia Tech

Korea Univ

MESI Writeback Invalidation Protocol

• To reduce two types of unnecessary bus transactions BusRdX that snoops and converts the block from S to M when

only you are the sole owner of the block BusRd that gets the line in S state when there is no sharers

(that lead to the overhead above)

• Introduce the Exclusive state One can write to the copy without generating BusRdX

• Illinois Protocol: Proposed by Pamarcos and Patel in 1984

• Employed in Intel, PowerPC, MIPS43Slide from Prof. H.H. Lee in Georgia

Tech

Korea Univ

MESI Writeback Invalidation (Processor Request)

44

Invalid

Exclusive Modified

Shared

PrRd / BusRd(not-S)

PrWr / ---

Processor-initiated

PrRd / --- PrRd, PrWr / ---

PrRd / ---S: Shared Signal

PrWr / BusRdX

PrRd / BusRd (S)

PrWr / BusRdX


Korea Univ

MESI Writeback Invalidation Protocol(Bus Transactions)

45

Invalid

Exclusive Modified

Shared


BusRd / Flush

BusRdX / Flush

BusRd / Flush*

Flush*: Flush for data supplier; no action for other sharers

BusRdX / Flush*

BusRd / Flush Or ---)

BusRdX / ---

• Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data• Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update

memory)

Modified Slide from Prof. H.H. Lee in Georgia Tech

Korea Univ

MESI Writeback Invalidation Protocol(Illinois Protocol)

46

Invalid

Exclusive Modified

Shared


BusRd / Flush

BusRdX / Flush

BusRd / Flush*BusRdX / Flush*

BusRdX / ---PrRd / BusRd(not-S)

PrWr / ---

Processor-initiated


PrRd / ---

PrWr / BusRdX

S: Shared Signal

PrWr / BusRdX

BusRd / Flush (or ---)



PrRd / BusRd (S)

Korea Univ

MOESI Protocol

47

• Introduce a notion of ownership ─ Owned state• Similar to Shared state

• The O state processor will be responsible for supplying data (copy in memory may be stale)

• Employed by Sun UltraSparc AMD Opteron

• In dual-core Opteron, cache-to-cache transfer is done through a system request interface (SRI) running at full CPU speed

CPU0

L2

CPU1

L2

System Request Interface

Crossbar

Hyper-Transport

MemController

Modified Slide from Prof. H.H. Lee in Georgia Tech

Korea Univ

MOESI Writeback Invalidation Protocol(Processor Request)

48

Invalid

Exclusive Modified

Shared

PrRd / BusRd(not-S)

PrWr / ---

Processor-initiated


PrRd / ---S: Shared Signal

PrWr / BusRdX

PrRd / BusRd (S)

PrWr / BusRdX

Owned

PrRd / ---PrWr / BusRdX

Korea Univ

MOESI Writeback Invalidation Protocol(Bus Transactions)

49

Invalid

Exclusive Modified

Shared


BusRd / Flush

BusRdX / Flush


BusRd / Flush (Or ---)

BusRdX / ---

Owned

BusRd / FlushBusRdX / Flush

BusRd / Flush

BusRd / Flush*BusRd / ---

BusRdX / ---BusRdX / Flush*

BusRd / Flush

Korea Univ

MOESI Writeback Invalidation Protocol(Bus Transactions)

50

Invalid

Exclusive Modified

Shared


BusRdX / Flush

BusRdX / ---

Owned

BusRd / FlushBusRdX / Flush

BusRd / Flush

BusRd / ---BusRdX / ---

BusRd / Flush

Korea Univ

Transient States in MSI• Design issues: Coherence transaction is not atomic

I → E, S (?) depending on Shared signal in MESI The next state cannot be determined until the request is launched on the bus

and the snoop result is available

51

BusRdX reads a memory block and invalidates other copiesBusUpgr invalidates potential remote cache copies

Korea Univ

Atomic & Non-atomic Buses• A split-transaction bus increases the available bus bandwidth by

breaking up a transaction into subtransactions

52

time

Atomic bus

Non-atomic bus (pipelined bus or split transaction bus)

addr1

read

data

addr2

read

data

addr1

read

data

addr2

read

data

addr3

read

data

addr4

read

data

Korea Univ

Issues with Pipelined Buses

53

Non-atomic bus (pipelined bus or split transaction bus)

addr1

read

data

addr1

read

data

addr1

read

data

addr1

write

data

• SGI Challenge (mid-1990s) has a system-wide table in each node to book-keep all outstanding requests A request is launched if no entry in the table matches the

address of the request

Silicon Graphics, Inc. was an American manufacturer of high-performance computing solutions, including computer hardware and software. -wiki

Korea Univ

SGI Challenge

54http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/0620/bks/SGI_Developer/books/REACT_PG/sgi_html/ch02.htmlhttp://www.computinghistory.org.uk/det/11263/SGI-Challenge-10000/http://en.wikipedia.org/wiki/SGI_Challenge

Korea Univ

Inclusion & Exclusion Properties

55

L2

CPU Core

Reg File

L1 I$ L1 D

Main Memory

Inclusion property

A block in L1 should be L2 as well L2 block eviction causes the invalidation of the L1

block L1 write causes L2 update Effective cache size is equal to L2 size Desirable for cache coherence

L2

CPU Core

Reg File

L1 I$ L1 D

Main Memory

Exclusion property

A block is located either L1 or L2 When a L1 block is replaced, it is

possibly located in L2 Better utilization of hardware

resources

Korea Univ

Cache Hierarchies

56“Achieving Non-Inclusive Cache Performance with Inclusive Caches”, MICRO, 2010

Effective cache sizes Inclusive: LLC Non-Inclusive: LLC ~ (LLC + L1s) Exclusive: LLC + L1s

Korea Univ

Coherency in Multi-level Cache Hierarchy

57

L2 Cache

CPU Core

Reg File

L1 I$ L1 D$

Main Memory

L2 Cache

CPU Core

Reg File

L1 I$ L1 D$

L2 is exclusive• All incoming bus requests

contend with CPU core for L1

Korea Univ

Coherency in Multi-level Cache Hierarchy

58

L2 Cache

CPU Core

Reg File

L1 I$ L1 D$

Main Memory

L2 Cache

CPU Core

Reg File

L1 I$ L1 D$

L2 is inclusive• L2 is used as a snoop filter• L2 line eviction forces the

L1 line eviction• If L1 is the writeback cache,

the blocks in L1 and L2 are not consistent Writethrough policy in L1 is

desirable Otherwise, L1 should be

snooped

Korea Univ

Nehalem Case Study

59

L2 Cache (8-way, 256KB)

CPU Core

Reg File

L1 I$ (32KB)

L1 D$ (32KB)


L2 Cache (256KB)

CPU Core

Reg File

L1 I$ (32KB)

L1 D$ (32KB)

..

Main Memory

writeback

writeback

writeback Inclusive

Non-Inclusive

4-cycle

In the L1 data cache and in the L2/L3 unified caches, the MESI (modified, exclusive, shared, invalid) cache protocol maintains consistency with caches of other processors. The L1 data cache and the L2/L3 unified caches have two MESI status flags per cache line.

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf



Korea Univ

Nehalem Uncore

60

Korea Univ

Sandy Bridge

61http://www.behardware.com/articles/815-3/intel-core-i7-and-core-i5-lga-1155-sandy-bridge.html

• Transactions from each core travel along the ring

• LLC slice (2MB each) are connected to the ring

Korea Univ

TLB and Virtual Memory

62

CPU core

TLB in MMU

Virtual (linear) address

Hard disk

Virtual memory Space

Physical

address

Processor

Windows XP

0

MS Word

123…

0x4F

19

0x3901

0xF…

13

Hello world

01

32

0123

0123

Main Memory

MMU: Memory Management Unit

Korea Univ

TLB with a Cache• TLB is a cache for page table• Cache is a cache (?) for instruction and data

Modern processors typically use physical address to access caches

63

Main Memory

virtual address

physical addressCPU

corePage table

MMUTLB

CPUCache physical

addressInstructions or data

Instructions or data

Korea Univ

Core i7 Case Study

64

L2 Cache (8-way, 256KB)

CPU Core

Reg File

L1 I$ (32KB)

L1 D$ (8-way,

32KB)


L2 Cache (256KB)

CPU Core

Reg File

L1 I$ (32KB)

L1 D$ (32KB)

..

Main Memory

L1: VIPT

L2: PIPT

L3: PIPT Inclusive

Non-Inclusive


ITLB DTLB ITLB DTL

B



Korea Univ

TLB Shootdown

• TLB inconsistency arises when a PTE in TLB is modified The PTE copy in other TLBs and main

memory is stale 2 cases

• Virtual-to-physical mapping change by OS• Page access right change by OS

• TLB shootdown procedure (similar to Page fault handling) A processor invokes Virtual memory

manager and it generates IPI (Inter-processor Interrupt)

Each processor invokes a software handler to remove the stale PTE and invalidate all the block copies in private caches

65

Cache

CPU Core

Reg File

Main Memory

TLB

Cache

CPU Core

Reg File

TLB

Korea Univ

False Sharing

66

Cache

CPU Core 0

Reg File

Main Memory

Cache

CPU Core 1

Reg File

Time#1CPU0 write

#2CPU1 write

#3CPU0 write

#3CPU1 read

• Data is loaded into cache on a block granularity (for example, 64B)

• CPUs share a block, but each CPU never uses the data modified by the other CPUs

Korea Univ

Backup Slides

67

Korea Univ

Intel Core 2 Duo

68

• Homogeneous cores• Bus-based on chip

interconnect• Shared on-die Cache

Memory • Traditional I/O

Classic OOO: Reservation Stations, Issue ports, Schedulers…etc

Large, shared set associative, prefetch, etc.

Source: Intel Corp.

Korea Univ

Core 2 Duo Microarchitecture

69

Korea Univ

Why Sharing on-die L2?

70

Korea Univ

Intel Quad-Core Processor (Kentsfield, Clovertown)

71

Korea Univ

AMD Barcelona’s Cache Architecture

72Source: AMD

Documents

Lecture 2. Snoop-based Cache Coherence Protocols