Upload
kenley
View
78
Download
0
Embed Size (px)
DESCRIPTION
COM503 Parallel Computer Architecture & Programming. Lecture 2. Snoop-based Cache Coherence Protocols. Prof. Taeweon Suh Computer Science Education Korea University. Flynn’s Taxonomy. A classification of computers , proposed by Michael J. Flynn in 1966 - PowerPoint PPT Presentation
Citation preview
Lecture 2. Snoop-based Cache Coherence Protocols
Prof. Taeweon SuhComputer Science Education
Korea University
COM503 Parallel Computer Architecture & Programming
Korea Univ
Flynn’s Taxonomy
• A classification of computers, proposed by Michael J. Flynn in 1966 Characterize computer designs in terms of the
number of distinct instructions issued at a time and the number of data elements they operate on
2
Single Instruction
Multiple Instruction
Single Data SISD MISD
Multiple Data SIMD MIMD
Source: Widipedia
Korea Univ
Flynn’s Taxonomy (Cont.)
• SISD Single Instruction Single Data Uniprocessor Example: Your desktop (notebook)
computer before the spread of dual or more core CPUs
• SIMD Single Instruction Multiple Data Each processor works on its own data
stream But all processors execute the same
instruction in lockstep Example: MMX and GPU
3Picture sources: Wikipedia
Korea Univ
SIMD Example
• MMX (Multimedia Extension) 64-bit registers == 2 32-bit integers, 4 16-bits integers, or 8 8-bit
integers processed concurrently• SSE (Streaming SIMD Extensions)
256-bit registers == 4 DP floating-point operations
4
Korea Univ
Flynn’s Taxonomy (Cont.)
• MISD Multiple Instruction Single Data Each processor executes
different instructions on the same data
Not used much
• MIMD Multiple Instruction Multiple Data Each processor executes its own
instruction for its own data Virtually, all the multiprocessor
systems are based on MIMD
5Pic ture sources: Wikipedia
Korea Univ
Multiprocessor Systems
• Shared memory systems Bus-based shared memory Distributed shared memory
• Current server systems (for example, Xeon-based servers)
• Cluster-based systems Supercomputers and datacenters
6
Korea Univ
Clusters
7http://www.tik.ee.ethz.ch/~ddosvax/cluster/
Supercomputer dubbed 7N (Cluster computer), 95th fastest in the world on the TOP500 in 2007
https://www.jlab.org/news/releases/jefferson-lab-boasts-virginias-fastest-computer
Korea Univ
Shared Memory Multiprocessor Models
8
P P P$
Bus-based shared memory
$ $
Memory
P P P$
Memory
Fully-connected shared memory
(Dancehall)
$ $
Memory
Interconnection Network
P
$Memory
Interconnection Network
P
$Memory
Distributed shared memory
Our Focus today
Korea Univ
Some Terminologies
• Shared memory systems can be classified into UMA (Uniform Memory Access) architecture NUMA (Non-Uniform Memory Access) architecture
• SMP (Symmetric Multiprocessor) is an UMA example Don’t be confused with SMT (Simultaneous
Multithreading)
9
Korea Univ
SMP (UMA) Systems
10http://www.evga.com/forums/tm.aspx?m=1897631&mpage=1
Antique (?) P-III based SMP
Sandy Bridge based motherboard
http://news.softpedia.com/newsImage/Gigabyte-Also-Details-Its-Sandy-Bridge-Motherboard-Replacement-Program-2.jpg/
Memory
P-IIIP-III$ $
Korea Univ
DSM (NUMA) Machine Examples
• Nehalem-based systems with QPI
11http://www.qdpma.com/systemarchitecture/SystemArchitecture_QPI.html
Nehalem-based
Xeon 5500
QPI: QuickPath Interconnect
Korea Univ
More Recent NUMA System
12http://ark.intel.com/products/64596/Intel-Xeon-Processor-E5-2690-20M-Cache-2_90-GHz-8_00-GTs-Intel-QPIhttp://www.intel.in/content/www/in/en/intelligent-systems/crystal-forest-server/xeon-e5-2600-e5-2400-89xx-ibd.html
http://www.anandtech.com/show/6533/gigabyte-ga7pesh1-review-a-dual-processor-motherboard-through-a-scientists-eyes
Korea Univ
Amdahl’s Law (Law of Diminishing Returns)
• Amdahl’s law is named after computer architect Gene Amdahl
• It is used to find the maximum expected improvement to an overall system
• The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program
13
Maximum speedup = (1 – P) + P / N
1
• P: Parallelizable portion of a program • N: # processors
Source: Widipedia
Korea Univ
WB & WT Caches
14
CPU core
Cache
MemoryX= 100
X= 100
Writeback
X= 300
CPU core
Cache
MemoryX= 100
X= 100
Writethrough
X= 300
X= 300
Korea Univ
Definition of Coherence
• Coherence is a property of a shared-memory architecture giving the illusion to the software that there is a single copy of every memory location, even if multiple copies exist
• A multiprocessor memory system is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order
15Modified Slide from Prof. H.H. Lee in Georgia Tech
Memory
P-IIIP-III$ $
Korea Univ
Definition of Coherence
• A multiprocessor memory system is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order
• Implicit definition of coherence Write propagation
• Writes are visible to other processes Write serialization
• All writes to the same location are seen in the same order by all processes
16Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
Why Cache Coherency?
• Closest cache level is private• Multiple copies of cache line
can be present across different processor nodes
• Local updates (writes) leads to incoherent state Problem exhibits in both write-
through and writeback caches
17Slide from Prof. H.H. Lee in Georgia Tech
Core i7
L2 Cache (256KB)
CPU Core
Reg File
L1 I$ (32KB)
L1 D$ (32KB)
L3 Cache (8MB) - Shared
L2 Cache (256KB)
CPU Core
Reg File
L1 I$ (32KB)
L1 D$ (32KB)
..
Korea Univ
Writeback Cache w/o Coherence
18
P
Cache
Memory
P
X= 100
X= 100Cache
P
CacheX= 100X= 505
read?
X= 100
read? write
Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
Writethrough Cache w/o Coherence
19
P
Cache
Memory
P
X= 100
X= 100Cache
P
CacheX= 100X= 505
X= 505
X= 505
Read? write
Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
Cache Coherence Protocols According to Caching Policies
• Write-through cache Update-based protocol Invalidation-based protocol
• Writeback cache Update-based protocol Invalidation-based protocol
20
Korea Univ
Bus Snooping based on Write-Through Cache
• All the writes will be shown as a transaction on the shared bus to memory
• Two protocols Update-based Protocol Invalidation-based Protocol
21Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
Bus Snooping• Update-based Protocol on Write-Through cache
22
P
Cache
Memory
P
X= 100
X= 100Cache
P
CacheX= 505
Bus transaction
Bus snoopX= 505
X= 505 X= 100
write
Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
Bus Snooping• Invalidation-based Protocol on Write-Through
cache
23
P
Cache
Memory
P
X= 100
X= 100Cache
P
CacheX= 505
Bus transaction
Bus snoop
X= 505
Load X
X= 505
write
X= 100
Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
A Simple Snoopy Coherence Protocol for a WT, No Write-Allocate Cache
24
Invalid
Valid
PrRd / BusRd
PrRd / --- PrWr / BusWr
BusWr / ---
PrWr / BusWr Processor-initiated TransactionBus-snooper-initiated Transaction
Observed / Transaction
Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
How about Writeback Cache?
• WB cache to reduce bandwidth requirement
• The majority of local writes are hidden behind the processor nodes
• How to snoop?
• Write Ordering
25Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
Cache Coherence Protocols for WB Caches
• A cache has an exclusive copy of a line if It is the only cache having a valid copy Memory may or may not have it
• Modified (dirty) cache line The cache having the line is the owner of the
line, because it must supply the block
26Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
Update-based Protocol on WB Cache
27
P
Cache
Memory
P
Cache
P
Cache
Bus transaction
X= 100X= 100X= 100
Store X
X= 505
updateupdate
X= 505X= 505
• Update data for all processor nodes who share the same data• Because a processor node keeps updating the memory location, a lot of
traffic will be incurredSlide from Prof. H.H. Lee in Georgia Tech
Korea Univ
Update-based Protocol on WB Cache
28
P
Cache
Memory
P
Cache
P
Cache
Bus transaction
X= 505X= 505X= 505
Load X
Hit !
Store X
X= 333
update update
X= 333X= 333
• Update data for all processor nodes who share the same data• Because a processor node keeps updating the memory
location, a lot of traffic will be incurredSlide from Prof. H.H. Lee in Georgia Tech
Korea Univ
Invalidation-based Protocol on WB Cache
• Invalidate the data copies for the sharing processor nodes• Reduced traffic when a processor node keeps updating the
same memory location29
P
Cache
P
Cache
P
Cache
Bus transaction
X= 100X= 100X= 100
Store X
invalidateinvalidate
X= 505
Memory
Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
Invalidation-based Protocol on WB Cache
30
P
Cache
P
Cache
P
Cache
Bus transaction
X= 505
Load X
Bus snoop
Miss !Snoop hit
X= 505
Memory
• Invalidate the data copies for the sharing processor nodes• Reduced traffic when a processor node keeps updating the
same memory locationSlide from Prof. H.H. Lee in Georgia Tech
Korea Univ
Invalidation-based Protocol on WB Cache
31
P
Cache
P
Cache
P
Cache
Bus transaction
X= 505
Store X
Bus snoop
X= 505X= 333
Store X
X= 987
Store XX= 444
• Invalidate the data copies for the sharing processor nodes• Reduced traffic when a processor node keeps updating the
same memory location
Memory
Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
MSI Writeback Invalidation Protocol
• Modified Dirty Only this cache has a valid copy
• Shared Memory is consistent One or more caches have a valid copy
• Invalid
• Writeback protocol: A cache line can be written multiple times before the memory is updated
32Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
MSI Writeback Invalidation Protocol
• Two types of request from the processor PrRd PrWr
• Three types of bus transactions posted by cache controller BusRd
• PrRd misses the cache• Memory or another cache supplies the line
BusRdX (Read-to-own)• PrWr is issued to a line which is not in the Modified state
BusWB• Writeback due to replacement• Processor does not directly involve in initiating this
operation
33Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
MSI Writeback Invalidation Protocol(Processor Request)
34
Modified
Invalid
Shared
PrRd / BusRd
PrRd / ---
PrWr / BusRdX
PrWr / ---
PrRd / ---
PrWr / BusRdX
Processor-initiated
Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
MSI Writeback Invalidation Protocol(Bus Transaction)
35
• Flush data on the bus• Both memory and requestor will
grab the copy• The requestor get data from
either Cache-to-cache transfer; or Memory
Modified
Invalid
Shared
Bus-snooper-initiated
BusRd / ---
BusRd / Flush
BusRdX / Flush BusRdX / ---
Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
MSI Writeback Invalidation Protocol(Bus transaction) Another possible
Implementation
36
Modified
Invalid
Shared
Bus-snooper-initiated
BusRd / ---
BusRd / Flush
BusRdX / Flush BusRdX / ---
• Anticipate no more reads from this processor
• A performance concern• Save “invalidation” trip if the
requesting cache writes the shared line later
BusRd / Flush
Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
MSI Writeback Invalidation Protocol
37
Modified
Invalid
Shared
Bus-snooper-initiated
BusRd / ---
PrRd / BusRd
PrRd / ---
PrWr / BusRdX
PrWr / ---
PrRd / ---
PrWr / BusRdX
Processor-initiated
BusRd / Flush
BusRdX / Flush BusRdX / ---
Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
MSI Example
38
P1
Cache
P2 P3
Bus
Cache Cache
MEMORY
BusRd
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data SupplierS --- --- BusRd MemoryP1 reads X
X=10
X=10 S
Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
MSI Example
39
P1
Cache
P2 P3
Bus
Cache Cache
MEMORY
X=10 S
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data SupplierS --- --- BusRd MemoryP1 reads X
P3 reads X
BusRd
X=10 S
S --- S BusRd Memory
X=10
Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
MSI Example
40
P1
Cache
P2 P3
Bus
Cache Cache
MEMORY
X=10 S
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data SupplierS --- --- BusRd MemoryP1 reads X
P3 reads X
X=10 S
S --- S BusRd Memory
P3 writes X
BusRdX
--- I M
I --- M BusRdX
X=10
X=-25
Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
MSI Example
41
P1
Cache
P2 P3
Bus
Cache Cache
MEMORY
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data SupplierS --- --- BusRd MemoryP1 reads X
P3 reads X
X=-25 M
S --- S BusRd Memory
P3 writes X
--- I
I --- M BusRdXP1 reads X
BusRd
X=-25 S S
S --- S BusRd P3 Cache
X=10X=-25
Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
MSI Example
42
P1
Cache
P2 P3
Bus
Cache Cache
MEMORY
Processor Action State in P1 State in P2 State in P3 Bus Transaction Data SupplierS --- --- BusRd MemoryP1 reads X
P3 reads X
X=-25 M
S --- S BusRd Memory
P3 writes X I --- M BusRdXP1 reads X
X=-25 S S
S --- S BusRd P3 Cache
X=10X=-25
P2 reads X
BusRd
X=-25 S
S S S BusRd MemorySlide from Prof. H.H. Lee in Georgia Tech
Korea Univ
MESI Writeback Invalidation Protocol
• To reduce two types of unnecessary bus transactions BusRdX that snoops and converts the block from S to M when
only you are the sole owner of the block BusRd that gets the line in S state when there is no sharers
(that lead to the overhead above)
• Introduce the Exclusive state One can write to the copy without generating BusRdX
• Illinois Protocol: Proposed by Pamarcos and Patel in 1984
• Employed in Intel, PowerPC, MIPS43Slide from Prof. H.H. Lee in Georgia
Tech
Korea Univ
MESI Writeback Invalidation (Processor Request)
44
Invalid
Exclusive Modified
Shared
PrRd / BusRd(not-S)
PrWr / ---
Processor-initiated
PrRd / --- PrRd, PrWr / ---
PrRd / ---S: Shared Signal
PrWr / BusRdX
PrRd / BusRd (S)
PrWr / BusRdX
Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
MESI Writeback Invalidation Protocol(Bus Transactions)
45
Invalid
Exclusive Modified
Shared
Bus-snooper-initiated
BusRd / Flush
BusRdX / Flush
BusRd / Flush*
Flush*: Flush for data supplier; no action for other sharers
BusRdX / Flush*
BusRd / Flush Or ---)
BusRdX / ---
• Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data• Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update
memory)
Modified Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
MESI Writeback Invalidation Protocol(Illinois Protocol)
46
Invalid
Exclusive Modified
Shared
Bus-snooper-initiated
BusRd / Flush
BusRdX / Flush
BusRd / Flush*BusRdX / Flush*
BusRdX / ---PrRd / BusRd(not-S)
PrWr / ---
Processor-initiated
PrRd / --- PrRd, PrWr / ---
PrRd / ---
PrWr / BusRdX
S: Shared Signal
PrWr / BusRdX
BusRd / Flush (or ---)
Flush*: Flush for data supplier; no action for other sharers
Slide from Prof. H.H. Lee in Georgia Tech
PrRd / BusRd (S)
Korea Univ
MOESI Protocol
47
• Introduce a notion of ownership ─ Owned state• Similar to Shared state
• The O state processor will be responsible for supplying data (copy in memory may be stale)
• Employed by Sun UltraSparc AMD Opteron
• In dual-core Opteron, cache-to-cache transfer is done through a system request interface (SRI) running at full CPU speed
CPU0
L2
CPU1
L2
System Request Interface
Crossbar
Hyper-Transport
MemController
Modified Slide from Prof. H.H. Lee in Georgia Tech
Korea Univ
MOESI Writeback Invalidation Protocol(Processor Request)
48
Invalid
Exclusive Modified
Shared
PrRd / BusRd(not-S)
PrWr / ---
Processor-initiated
PrRd / --- PrRd, PrWr / ---
PrRd / ---S: Shared Signal
PrWr / BusRdX
PrRd / BusRd (S)
PrWr / BusRdX
Owned
PrRd / ---PrWr / BusRdX
Korea Univ
MOESI Writeback Invalidation Protocol(Bus Transactions)
49
Invalid
Exclusive Modified
Shared
Bus-snooper-initiated
BusRd / Flush
BusRdX / Flush
Flush*: Flush for data supplier; no action for other sharers
BusRd / Flush (Or ---)
BusRdX / ---
Owned
BusRd / FlushBusRdX / Flush
BusRd / Flush
BusRd / Flush*BusRd / ---
BusRdX / ---BusRdX / Flush*
BusRd / Flush
Korea Univ
MOESI Writeback Invalidation Protocol(Bus Transactions)
50
Invalid
Exclusive Modified
Shared
Bus-snooper-initiated
BusRdX / Flush
BusRdX / ---
Owned
BusRd / FlushBusRdX / Flush
BusRd / Flush
BusRd / ---BusRdX / ---
BusRd / Flush
Korea Univ
Transient States in MSI• Design issues: Coherence transaction is not atomic
I → E, S (?) depending on Shared signal in MESI The next state cannot be determined until the request is launched on the bus
and the snoop result is available
51
BusRdX reads a memory block and invalidates other copiesBusUpgr invalidates potential remote cache copies
Korea Univ
Atomic & Non-atomic Buses• A split-transaction bus increases the available bus bandwidth by
breaking up a transaction into subtransactions
52
time
Atomic bus
Non-atomic bus (pipelined bus or split transaction bus)
addr1
read
data
addr2
read
data
addr1
read
data
addr2
read
data
addr3
read
data
addr4
read
data
Korea Univ
Issues with Pipelined Buses
53
Non-atomic bus (pipelined bus or split transaction bus)
addr1
read
data
addr1
read
data
addr1
read
data
addr1
write
data
• SGI Challenge (mid-1990s) has a system-wide table in each node to book-keep all outstanding requests A request is launched if no entry in the table matches the
address of the request
Silicon Graphics, Inc. was an American manufacturer of high-performance computing solutions, including computer hardware and software. -wiki
Korea Univ
SGI Challenge
54http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/0620/bks/SGI_Developer/books/REACT_PG/sgi_html/ch02.htmlhttp://www.computinghistory.org.uk/det/11263/SGI-Challenge-10000/http://en.wikipedia.org/wiki/SGI_Challenge
Korea Univ
Inclusion & Exclusion Properties
55
L2
CPU Core
Reg File
L1 I$ L1 D
Main Memory
Inclusion property
A block in L1 should be L2 as well L2 block eviction causes the invalidation of the L1
block L1 write causes L2 update Effective cache size is equal to L2 size Desirable for cache coherence
L2
CPU Core
Reg File
L1 I$ L1 D
Main Memory
Exclusion property
A block is located either L1 or L2 When a L1 block is replaced, it is
possibly located in L2 Better utilization of hardware
resources
Korea Univ
Cache Hierarchies
56“Achieving Non-Inclusive Cache Performance with Inclusive Caches”, MICRO, 2010
Effective cache sizes Inclusive: LLC Non-Inclusive: LLC ~ (LLC + L1s) Exclusive: LLC + L1s
Korea Univ
Coherency in Multi-level Cache Hierarchy
57
L2 Cache
CPU Core
Reg File
L1 I$ L1 D$
Main Memory
L2 Cache
CPU Core
Reg File
L1 I$ L1 D$
L2 is exclusive• All incoming bus requests
contend with CPU core for L1
Korea Univ
Coherency in Multi-level Cache Hierarchy
58
L2 Cache
CPU Core
Reg File
L1 I$ L1 D$
Main Memory
L2 Cache
CPU Core
Reg File
L1 I$ L1 D$
L2 is inclusive• L2 is used as a snoop filter• L2 line eviction forces the
L1 line eviction• If L1 is the writeback cache,
the blocks in L1 and L2 are not consistent Writethrough policy in L1 is
desirable Otherwise, L1 should be
snooped
Korea Univ
Nehalem Case Study
59
L2 Cache (8-way, 256KB)
CPU Core
Reg File
L1 I$ (32KB)
L1 D$ (32KB)
L3 Cache (8MB) - Shared
L2 Cache (256KB)
CPU Core
Reg File
L1 I$ (32KB)
L1 D$ (32KB)
..
Main Memory
writeback
writeback
writeback Inclusive
Non-Inclusive
4-cycle
In the L1 data cache and in the L2/L3 unified caches, the MESI (modified, exclusive, shared, invalid) cache protocol maintains consistency with caches of other processors. The L1 data cache and the L2/L3 unified caches have two MESI status flags per cache line.
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
Korea Univ
Nehalem Uncore
60
Korea Univ
Sandy Bridge
61http://www.behardware.com/articles/815-3/intel-core-i7-and-core-i5-lga-1155-sandy-bridge.html
• Transactions from each core travel along the ring
• LLC slice (2MB each) are connected to the ring
Korea Univ
TLB and Virtual Memory
62
CPU core
TLB in MMU
Virtual (linear) address
Hard disk
Virtual memory Space
Physical
address
Processor
Windows XP
0
MS Word
123…
0x4F
19
0x3901
0xF…
13
Hello world
01
32
0123
0123
Main Memory
MMU: Memory Management Unit
Korea Univ
TLB with a Cache• TLB is a cache for page table• Cache is a cache (?) for instruction and data
Modern processors typically use physical address to access caches
63
Main Memory
virtual address
physical addressCPU
corePage table
MMUTLB
CPUCache physical
addressInstructions or data
Instructions or data
Korea Univ
Core i7 Case Study
64
L2 Cache (8-way, 256KB)
CPU Core
Reg File
L1 I$ (32KB)
L1 D$ (8-way,
32KB)
L3 Cache (8MB) - Shared
L2 Cache (256KB)
CPU Core
Reg File
L1 I$ (32KB)
L1 D$ (32KB)
..
Main Memory
L1: VIPT
L2: PIPT
L3: PIPT Inclusive
Non-Inclusive
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
ITLB DTLB ITLB DTL
B
Korea Univ
TLB Shootdown
• TLB inconsistency arises when a PTE in TLB is modified The PTE copy in other TLBs and main
memory is stale 2 cases
• Virtual-to-physical mapping change by OS• Page access right change by OS
• TLB shootdown procedure (similar to Page fault handling) A processor invokes Virtual memory
manager and it generates IPI (Inter-processor Interrupt)
Each processor invokes a software handler to remove the stale PTE and invalidate all the block copies in private caches
65
Cache
CPU Core
Reg File
Main Memory
TLB
Cache
CPU Core
Reg File
TLB
Korea Univ
False Sharing
66
Cache
CPU Core 0
Reg File
Main Memory
Cache
CPU Core 1
Reg File
Time#1CPU0 write
#2CPU1 write
#3CPU0 write
#3CPU1 read
• Data is loaded into cache on a block granularity (for example, 64B)
• CPUs share a block, but each CPU never uses the data modified by the other CPUs
Korea Univ
Backup Slides
67
Korea Univ
Intel Core 2 Duo
68
• Homogeneous cores• Bus-based on chip
interconnect• Shared on-die Cache
Memory • Traditional I/O
Classic OOO: Reservation Stations, Issue ports, Schedulers…etc
Large, shared set associative, prefetch, etc.
Source: Intel Corp.
Korea Univ
Core 2 Duo Microarchitecture
69
Korea Univ
Why Sharing on-die L2?
70
Korea Univ
Intel Quad-Core Processor (Kentsfield, Clovertown)
71
Korea Univ
AMD Barcelona’s Cache Architecture
72Source: AMD