1
1
Symmetric Multiprocessor (SMP)
ECE 7660
2© C. Xu 1999-2005
Shared Memory Multiprocessors
Symmetric Multiprocessors (SMPs)o Symmetric access to all of main memory from any processor
Dominate the server marketo Building blocks for larger systems; arriving to desktop
Attractive as servers and for parallel programso Fine-grain resource sharing
o Uniform access via loads/stores
o Automatic data movement and coherent replication in caches
o Useful for operating system too
Normal uniprocessor mechanisms to access data (reads and writes)o Key is extension of memory hierarchy to support multiple processors
2
3© C. Xu 1999-2005
Supporting Programming Models
o Address translation and protection in hardware (hardware SAS)
o Message passing using shared memory buffers
– can be very high performance since no OS involvement necessary
o Focus here on supporting coherent shared address space
Multiprogramming
Shared address space
Message passing Programming models
Communication abstractionUser/system boundary
Compilationor library
Operating systems support
Communication hardware
Physical communication medium
Hardware/software boundary
4© C. Xu 1999-2005
Natural Extensions of Memory System
I/O devicesMem
P1
$ $
Pn
P1
$
Interconnection network
$
Pn
Mem Mem
(b) Bus-based shared memory
(c) Dancehall
(a) Shared cache
Bus
P1
Switch
Main memory
Pn
(Interleaved)
(Interleaved)
First-level $
P1
$
Interconnection network
$
Pn
Mem Mem
(d) Distributed-memory
3
5© C. Xu 1999-2005
Caches and Cache Coherence
Caches play key role in all caseso Reduce average data access time
o Reduce bandwidth demands placed on shared interconnect
But private processor caches create a problemo Copies of a variable can be present in multiple caches
o A write by one processor may not become visible to others– They’ll keep accessing stale value in their caches
o Cache coherence problem
o Need to take actions to ensure visibility
6© C. Xu 1999-2005
Shared Cache
Advantageso Sharing of working sets
o Low-latency sharing and prefetching across proc
o No coherence problem
Disadvantageso high bandwidth needs and negative interference (e.g. conflicts)
o latency increased due to intervening switch and cache size
o Mid 80s: a couple of processors on a board (Encore, Sequent)
o Today: multiprocessor on a chip
P1
Switch
Main memory
Pn
(Interleaved)
(Interleaved)
First-level $
4
7© C. Xu 1999-2005
A Coherent Memory System
Intuitive behavior model of a memory system:
Reading a location should return the latest value
written to that location
We rely on this model to reason correctness of a
sequential program
8© C. Xu 1999-2005
Example of Cache Coherence Problem
o Processors see different values for u after event 3
o With write back caches, value written back to memory depends on
happenstance of which cache flushes or writes back value when
– Processes accessing main memory may see very stale value
o Unacceptable to programs, and frequent!
I/O devices
Memory
P1
$ $ $
P2 P3
12
34 5u= ?u= ?
u:5
u:5
u:5
u= 7
5
9© C. Xu 1999-2005
Cache Coherence in Uniprocessors for I/O
The intuitive behavior model becomes invalid in
uniprocessors for I/Oo coherence between I/O devices and processors
o But infrequent so software solutions work– uncacheable memory, uncacheable operations, flush pages, pass I/O data
through caches
But coherence problem much more critical in
multiprocessorso Pervasive
o Performance-critical
o Must be treated as a basic hardware design issue
10© C. Xu 1999-2005
Problems with the Intuition
Recall: Value returned by read should be lastest value
written. But, “lastest” is not well-defined
In seq. case, defined in terms of program order, not timeo Order of ops in machine language presented to processor
o “Subsequent” defined in analogous way, and well defined
In parallel case, program order defined within a process,
but need to make sense of orders across processes
Must define a meaningful semantics
6
11© C. Xu 1999-2005
Extension to Multiprocessors
Memory operationo a single read, write or read-modify-write to a memory location, assuming to
execute atomically w.r.t each other
Issue Event: it occurs when an operation leaves a processor’s
internal environment and is presented to memory system (cache,
buffer …)
Perform Event: the operation appears to have been performed, as far
as a processor can tell from other mem operations it issues
o A write appears to be performed when a subsequent read by any processor
returns the value of that write or a later write
o A read appears to be performed when subsequent writes issued by any
processor cannot affect the value returned by the read
Complete: have been performed with respect to all processors
12© C. Xu 1999-2005
Sharpening the Intuition
– Consider a single shared memory and no caches, memory imposes a
serial or total order on all operations to the same physical location
o Operations to the location from a given processor are in program order
o The order of operations to the location from different processors is some
interleaving that preserves the individual program orders
– “Lastest” now means most recent in a hypothetical serial order that
maintains these properties
– For the serial order to be consistent, all processors must “see” writes
to the location in the same order
– And, program should behave as if some serial order is enforced
o Order in which things appear to happen, not actually happen
7
13© C. Xu 1999-2005
Formal Definition of Coherence
A memory system is coherent if the results of any execution of a program are such that, for each location, it is possible to construct a hypothetical serial order of all operations to the location that is consistent with the results of the execution and in which:
o operations issued by a process occur in the order in which they were issued to the memory system by that process, and
o the value returned by a read is the value written by the last write to that location in the serial order
14© C. Xu 1999-2005
Two Necessary Features:
Write propagation: o value written must become visible to others
Write serialization: o writes to a location seen in same order by all
– if process P sees the w1 after w2, process Q should
not see w2 before w1
– no need for analogous read serialization since reads
not visible to others
8
15© C. Xu 1999-2005
Cache Coherence Using a Bus
Built on top of two fundamentals of uniprocessor systemso Bus transactions
o State transition diagram in cache
Uniprocessor bus transaction:o Three phases: arbitration, command/address, data transfer
o All devices observe addresses, one is responsible
Uniprocessor cache states:o Effectively, every block is a finite state machine
o Write-through, write no-allocate has two states: valid, invalid
o Writeback caches have one more state: modified (“dirty”)
Multiprocessors extend both these somewhat to implement
coherence
16© C. Xu 1999-2005
Snooping-based CoherenceBasic Idea
o Transactions on bus are visible to all processorso Processors or their representatives (cache controllers) can snoop bus and take
action on relevant events (e.g. change state)
Implementing a Protocol
o Cache controller now receives inputs from both sides:
– Requests from processor, bus requests/responses from snooper
o In either case, takes zero or more actions
– Updates state, responds with data, generates new bus transactions
o Protocol is distributed algorithm: cooperating state machines
– Set of states, state transition diagram, actions
o Granularity of coherence is typically cache block
9
17© C. Xu 1999-2005
Coherence with Write-through Caches
Key extensions to uniprocessor: snooping, invalidating/updating caches
o no new states or bus transactions in this case
o invalidation- versus update-based protocols
Write propagation: even in inval case, later reads will see new value
o inval causes miss on later access, and memory up-to-date via write-through
I/O devicesMem
P1
$
Bus snoop
$
Pn
Cache-memorytransaction
18© C. Xu 1999-2005
Write-through State Transition Diagram
o Write will invalidate all other caches (no local change of state)
– can have multiple simultaneous readers of block,but write invalidates them
I
V
PrRd/BusRd
PrRd/—
PrWr/BusWr
BusWr/--
PrOcessor-initiated transactions
Bus-snoope-initiated transactonsr
PrWr/BusWr
V: ValidI: Invalid
10
19© C. Xu 1999-2005
Is it Coherent?To show that for any execution under the protocol a total order of mem operations can be constructed that satisfies program order and write serialization.
Assume both bus transactions and memory operations are atomic. o with one-level cache, assume invalidations applied during bus xaction
o These assumptions can be relaxed in more complex systems.
Under the protocol, all writes go to buso Writes serialized by order in which they appear on bus (bus order)
o Per above assumptions, invalidations applied to caches in bus order
How to insert reads in this order?o processors see writes through reads
o But, reads to a location are not completely serialized due to read hits
20© C. Xu 1999-2005
Is it Conherent (cont’): Ordering reads
– Read misses: appear on bus, and will see last write in bus
order
– Read hits: do not appear on buso But value read was placed in cache by either
– most recent write by this processor, or
– most recent read miss by this processor
o Both these transactions appear on the bus
o So reads hits also see values as being produced in consistent bus
order
11
21© C. Xu 1999-2005
Problem with Write-Through
High bandwidth requirementso Every write from every processor goes to shared bus and memory
o Consider 200MHz, 1CPI processor, and 15% instrs. are 8-byte stores
o Each processor generates 30M stores or 240MB data per second
o How many processors can be supported by a 1GB/s bus without saturating ?
#bytes go to the bus per sec = 200M x 0.15 x 8 = 240M
#processor supported by the bus = 1GB/240M = four
Write-back caches absorb most writes as cache hitso Write hits don’t go on bus
o But now how do we ensure write propagation and serialization?
o Need more sophisticated protocols: large design space
22© C. Xu 1999-2005
Snooping Protocols for Write-Back Cache
No need to change processor, main memory, cache …o Extend cache controller and exploit bus (provides serialization)
Dirty state now also indicates exclusive ownershipo Exclusive: only cache with a valid copy (main memory may be
too)
o Owner: responsible for supplying block upon a request for it
Design spaceo Invalidation versus Update-based protocols
o Set of states
12
23© C. Xu 1999-2005
Invalidation-based ProtocolsExclusive means can modify without notifying anyone else
o i.e. without bus transaction
o Must first get block in exclusive state before writing into it
o Even if already in valid state, need transaction, so called a write miss
Store to non-dirty data generates a read-exclusive bus transaction
o Tells others about impending write, obtains exclusive ownership– makes the write visible, i.e. write is performed
– may be actually observed (by a read miss) only later
– write hit made visible (performed) when block updated in writer’s cache
o Only one RdX can succeed at a time for a block: serialized by bus
Read and Read-exclusive bus transactions drive coherence actionso Writeback transactions also, but not caused by memory operation and quite
incidental to coherence protocol– note: replaced block that is not in modified state can be dropped
24© C. Xu 1999-2005
Basic MSI Writeback Inval Protocol
Stateso Invalid (I)
o Shared (S): one or more
o Dirty or Modified (M): one only
Processor Events: o PrRd (read)o PrWr (write)
Bus Transactionso BusRd: asks for copy with no intent to modify
o BusRdX: asks for copy with intent to modify
o BusWB: updates memory
Actionso Update state, perform bus transaction, flush value onto bus
13
25© C. Xu 1999-2005
State Transition Diagram
o Write to shared block (Write miss)
– Already have latest data; can use upgrade (BusUpgr) instead of BusRdXto reduce data traffic in bus protocol
– BusUpgr obtains exclusive ownership like a BusRdX, but it doesn’t cause memory to respond with the data
o Replacement changes state of two blocks:
– Outgoing block changes its state to Invalid
– Incoming block changes its state from Invalid to a new state
PrRd/—
PrRd/—
PrWr/BusRdXBusRd/—
PrWr/—
S
M
I
BusRdX/Flush
BusRdX/—
BusRd/Flush
PrWr/BusRdX
PrRd/BusRd
26© C. Xu 1999-2005
Processor Action P1 P2 P3 Bus Action Data Supplied ByP1 reads u S -- -- BusRd Memory
P3 reads u S -- S BusRd Memory
P3 writes u I -- M BusRdx P3
P1 reads u S -- S BusRd P3 cache
P2 reads u S S S BusRd Memory *
* “P1 reads u” flushes the P3 cache and updates the memory copy
I/O devices
Memory
P1
$ $ $
P2 P3
12
34 5
u= ?u = ?
u:5
u:5
u:5
u= 7
14
27© C. Xu 1999-2005
Satisfying Coherence
Write propagation is clear: via bus read-exclusive transaction
Write serialization?o All writes that appear on the bus (BusRdX) ordered by the bus
– Write performed in writer’s cache before it handles other transactions, so ordered in
same way even w.r.t. writer
o Reads that appear on the bus ordered w.r.t. these writes that don’t appear on
the bus:– sequence of such writes between two bus xactions for the block must come from
same processor, say P
– in serialization, the sequence appears between these two bus xactions
– reads by P will seem them in this order w.r.t. other bus transactions
– reads by other processors separated from sequence by a bus xaction, which places
them in the serialized order w.r.t the writes
– so reads by all processors see writes in same order
28© C. Xu 1999-2005
MESI (4-state) Invalidation Protocol
Problem with MSI protocolo Reading and modifying data is 2 bus xactions, even if no one sharing
– e.g. even in sequential program
– BusRd (I->S) followed by BusRdX or BusUpgr (S->M)
Add exclusive state: write locally without xaction, but not modifiedo Main memory is up to date, so cache not necessarily owner
o States
– invalid
– exclusive or exclusive-clean (only this cache has copy, but not modified)
– shared (two or more caches may have copies)
– modified (dirty)
o I -> E on PrRd if noone else has copy– needs “shared” signal on bus: wired-or line asserted in response to BusRd
15
29© C. Xu 1999-2005
MESI State Transition Diagram
o An additional control signal S
to determine on a BusRd if any
other cache currently holds the
block– BusRd(S) means shared line S
asserted on BusRd transaction
– BusRd(S): shared line S not
asserted
– Plain BusRd: don’t care S
o Flush’: if cache-to-cache
sharing, only one cache flushes
data
– C2C sharing adds complexity:
mem wait until no cache will
supply data
PrW / --
BusRd/Flush
PrRd/
BusRdX/Flush
PrWr/BusRdX
PrWr/—
PrRd/—
PrRd/—BusRd/Flush′
E
M
I
S
PrRd
BusRd(S)
BusRdX/Flush’
BusRdX/Flush
BusRd/Flush
PrWr/BusRdX
PrRd/BusRd(S)
30© C. Xu 1999-2005
o Processor Action P1 P2 P3 Bus Action Data Supplied ByP1 reads u
P3 reads u
P3 writes u
P1 reads u
P2 reads u
I/O devices
Memory
P1
$ $ $
P2 P3
12
34 5
u= ?u= ?
u:5
u:5
u:5
u= 7
16
31© C. Xu 1999-2005
Update-based Protocols
A write operation updates values in other cacheso New, update bus transaction
Advantageso Other processors don’t miss on next access: reduced latency
– In invalidation protocols, they would miss and cause more transactions
o Single bus transaction to update several caches can save bandwidth– Also, only the word written is transferred, not whole block
Disadvantageso Multiple writes by same processor cause multiple update transactions
– In invalidation, first write gets exclusive ownership, others local
Detailed tradeoffs more complex
32© C. Xu 1999-2005
Dragon Write-back Update Protocol
4 stateso Exclusive-clean or exclusive (E): I and memory have it
o Shared clean (Sc): I, others, and maybe memory, but I’m not owner
o Shared modified (Sm): I and others but not memory, and I’m the owner
– Sm and Sc can coexist in different caches, with only one Sm
o Modified or dirty (D): I and, no one else
No invalid stateo If in cache, cannot be invalid
o If not present in cache, can view as being in not-present or invalid state
New processor events: PrRdMiss, PrWrMisso Introduced to specify actions when block not present in cache
New bus transaction: BusUpdo Broadcasts single word written on bus; updates other relevant caches
17
33© C. Xu 1999-2005
Dragon State Transition Diagram
E Sc
Sm M
PrWr/—PrRd/—
PrRd/—
PrRd/—
PrRdMiss/BusRd(S)PrRdMiss/BusRd(S)
PrWr/—
PrWrMiss/(BusRd(S); BusUpd) PrWrMiss/BusRd(S)
PrWr/BusUpd(S)
PrWr/BusUpd(S)
BusRd/—
BusRd/Flush
PrRd/— BusUpd/Update
BusUpd/Update
BusRd/Flush
PrWr/BusUpd(S)
PrWr/BusUpd(S)
34© C. Xu 1999-2005
o Processor Action P1 P2 P3 Bus Action Data Supplied ByP1 reads u E -- -- BusRd Memory
P3 reads u Sc -- Sc BusRd Memory
P3 writes u Sc -- Sm BusUpd P3 proc
P1 reads u Sc -- Sm null
P2 reads u Sc Sc Sm BusRd P3 cache
I/O devices
Memory
P1
$ $ $
P2 P3
12
34 5u= ?u= ?
u:5
u:5
u:5
u= 7
18
35© C. Xu 1999-2005
Invalidate versus Update
Basic question of program behavioro Is a block written by one processor read by others before it is rewritten?
Invalidation:
o Yes => readers will take a miss
o No => multiple writes without additional traffic
– and clears out copies that won’t be used again
Update:
o Yes => readers will not miss if they had a copy previously
– single bus transaction to update all copies
o No => multiple useless updates, even to dead copies
36© C. Xu 1999-2005
Update versus Invalidate
Much debate over the years: tradeoff depends on sharing patterns
Intuition:o If those that used continue to use, and writes between use are few, update
should do better– e.g. producer-consumer pattern
o If those that use unlikely to use again, or many writes between reads, updates
not good– “pack rat” phenomenon particularly bad under process migration
– useless updates where only last one will be used
Can construct scenarios where one or other is much better
Invalidation protocols much more popular (more later)o Some systems provide both, or even hybrid
o E.g. competitive: observe patterns at runtime and change protocol
19
37© C. Xu 1999-2005
Example:
Pattern 1: Repeat k times: P1 write x, followed by a sequence of reads from processor 2 to P
Pattern 2: Repeat k times: P1 writes x M times and then P2 reads the value
What is the relative cost for update- and inval-basedprotocol in terms of number of cache misses and bustraffic?
Assume an inval/upgrade transaction consumes 6 (5+1) bytesan update takes 14 (6+ 8) bytesa regular cache miss takes 70 (6+64) bytesassume P=16, M=10, k=10
38© C. Xu 1999-2005
Pattern 2 with Update Scheme
K= 1 2 3 … … 10w1 miss update updatew1 hit update updatew1 hit update update……w1 hit update updater2 miss hit hit
traffic = 2 misses x 70 + (k-1)x(m-1)updates x 14
20
39© C. Xu 1999-2005
Pattern 2 with MESI Invalidation Scheme
K= 1 2 3 …… 10w1 miss inval inva invalw1 hit hit hit hitw1 hit hit hit hit……w1 hit hit hit hitr2 miss miss miss miss
Traffic= 11 misses x 70 + (k-1) Upgrade
40© C. Xu 1999-2005
Pattern 1:With an update scheme:
Number of misses = P; #Update= K-1 Traffic = P * RdMiss + (k-1) Update
With an invalidate scheme:Number of misses = P + (k-1)*p=16+9*16=160Traffic = read misses * RdMiss + (k-1)Upgrade
= 151*70 + 9*6 = 10,642Pattern 2:
With an update scheme:Number of misses = 2Traffic = 2*RdMiss+M*(k-1)Update
= 2*70+10*9*14 = 1,400 bytesWith an invalidate scheme
Number of misses = 2 + (k-1) = 11Traffic = misses * RdMiss + (k-1) Upgrade
= 11*70 + 9*6 = 824 bytes
21
41© C. Xu 1999-2005
Classification of Cache Misses
Multiprocessors add new kind of miss to cold, capacity, conflicto Cold misses (compulsory misses) occur on the first reference to a mem block
by a processor
o Capacity misses occur when all the blocks that are referenced by a proc during the execution of a program do not fit in the cache
o Conflict misses occur in caches with less than full associativity
o Coherence misses: true sharing and false sharing– True sharing occurs when a word produced by one proc is used by another.
It is inherent to a given parallel decomposition and assignement
– False sharing occurs when independent data words accessed by different processors happen to be placed in the same block, and at least one of the accesses is a write. False sharing misses are an example of the artifactualcommunication.
42© C. Xu 1999-2005
Example:
Assume that words x1 and x2 are in the same cache block in a clean state in the caches of P1 and P2, which have previously read x1 and x2. Assuming the following sequence of event, identify each miss as a true sharing miss, a false sharing miss,or a hit. Any miss that would occur if the block size were oneword is designated as a true sharing miss.
Time P1 P21 Write x12 Read x23 Write x14 Write x25 Read x2
True sharingFalse sharingFalse sharingFalse sharingTrue sharing
22
43© C. Xu 1999-2005
A Classification of Cache Misses
o Many mixed categories because a
miss may have multiple causes
Miss classi cation
Reasonfor miss
First reference tomemory block by processor
First accesssystemwide
Yes
No
Writtenbefore
Yes
No
Modi ed word(s) accessedduring lifetime
Yes
No
1. Cold
2. Cold
4. True-sharing-
3. False-sharing-
Reason forelimination of
last copy
Replacement
Invalidation
Old copywith state = invalid
still there
YesNo
8. Pure-7. Pure-
6. True-sharing-inval-cap
5. False-sharing-inval-cap
Modi ed word(s) accessedduring lifetime
Modi ed word(s) accessedduring lifetime
Yes
No YesNo
false-sharingtrue-sharing
Has blockbeen modi ed since
replacement
No Yes
10. True-sharing-9. Pure-
12. True-sharing-11. False-sharing-
Modi ed word(s) accessedduring lifetime Modi ed
word(s) accessedduring lifetime
YesNo
YesNo
capacity
Other
cold
cold
cap-inval cap-inval
capacity
44© C. Xu 1999-2005
Impact of Cache Block Size
Reducing misses architecturally in invalidation protocol
o Capacity: enlarge cache; increase block size (if spatial locality)
o Conflict: increase associativity
o Cold and Coherence: only block size
Increasing block size has advantages and disadvantages
o Can reduce misses if spatial locality is good
o Can hurt too
– increase misses due to false sharing if spatial locality not good
– increase misses due to conflicts in fixed-size cache
– increase traffic due to fetching unnecessary data and due to false sharing
– can increase miss penalty and perhaps hit cost
23
45© C. Xu 1999-2005
Performance Measurement: An Example
AlphaServer 4100o 4 Alpha 21164 processor, 300MHz
o Each proc has a 3-level cache hierarchy
o L1: 8K DM I-cache, 8K DM D-cache, block size=32 byte, D-cache is write
through, using a write buffer
o L2: 96K on-chip unified 3-way set-assoc cache, 32byte block size, write back
o L3: off-chip, combined DM cache, 2M bytes, 64 byte block size, write back
o Latency of an access– 7 cycles to L2,
– 21 cycles to L3
– 80 cycles to main memory
– Cache-to-cache transfer takes 125 cycles
46© C. Xu 1999-2005
Execution Breakdown
Benchmarks:o OLTP: online transaction
processing, CPI=7.0o DSS: a decision support
systems, based on TPC-D, CPI=1.6
o AV: a Web index search (AltaVista), CPI=1.3
Very poor perf in OLTP, because large numbers of expensive L3 misses
24
47© C. Xu 1999-2005
Cache Miss Sources
Contributing causes of mem access
cycles shift as the cache size is
increased
o L3 cache: 2-way SA
Two large sources shrink
o Instr and capcity/conflict
But cold, false sharing, true sharing
misses are unaffected
True sharing becomes a dominant
factor in large cache
increasing cache size eliminates
most of the uni-proc misses, while
leaving the MP misses untouched.
48© C. Xu 1999-2005
Impact of True Sharing
As the number of processors
increases, true sharing miss
rate increases, which is not
compensated for by any
decrease in the uniproc misses
Overall, # memory access
cycles per instruction increases
25
49© C. Xu 1999-2005
Impact of Block Size
True sharing miss rate decreases by more than a factor of 2, indicating locality in the true sharing patternsCold start miss rate significantly
reducedConflict/capacity misses show a
small decrease, indicating the spatial locality is not high in the uni-proc missesFalse sharing miss rate nearly
doubles, although small in absolute terms.
50© C. Xu 1999-2005
Exploiting Spatial Locality
Besides capacity, granularities are important:o Granularity of allocation
o Granularity of communication or data transfer
o Granularity of coherence
Major spatial-related causes of artifactual communication:o Conflict misses
o Data distribution/layout (allocation granularity)
o Fragmentation (communication granularity)
o False sharing of data (coherence granularity)
All depend on how spatial access patterns interact with data structureso Fix problems by modifying data structures, or layout/alignment
26
51© C. Xu 1999-2005
Spatial Locality Example
Repeated sweeps over 2-d grid, each time adding 1 to elements
Natural 2-d versus higher-dimensional array representation
P6 P7P4
P8
P0 P3
P5 P6 P7P4
P8
P0 P1 P2 P3
P5
P2P1
Page straddlespartition boundaries:difficult to distribute memory well
Cache blockstraddles partitionboundary
(a) Two-dimensional array
Page doesnot straddlepartitionboundary
Cache block is within a partition
(b) Four-dimensional array
Contiguity in memory layout
52© C. Xu 1999-2005
Example Performance Impact
Equation solver on SGI Origin2000
Spe
edup
Number of processors
Spe
edup
Number of processors
lll
l
l
l
ss
s
s
s
s
nnn
n
n
n
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310
5
10
15
20
25
30
lll
l
l
l
l
nnn
n
n
n
n
sss
s
s
s
s
uuu
u
u
u
u
666
6
6
6
6
1 3 5 7 9 11 1315 1719 21 2325 27 29310
5
10
15
20
25
30
35
40
45
504D
4D-rr
n 2D-rr
s 2D
u Rows-rr
6 Rowsl 2D
n 4D
s Rows
27
53© C. Xu 1999-2005
Summary
Cache organizationo Shared cache vs distributed cache
Snoopy cache coherence protocolo MESI protocol
o Dragon protocol
Cache misseso True sharing vs false sharing
Architectual implicationso locality exploitation to reduce artificial communication