Download pdf - Symmetric Multiprocessor (SMP) - Wayne State …czxu/ece7660_f05/snoopy.pdf1 1 Symmetric Multiprocessor (SMP) ECE 7660 2 © C. Xu 1999-2005 Shared Memory Multiprocessors Symmetric

1

1

Symmetric Multiprocessor (SMP)

ECE 7660

2© C. Xu 1999-2005

Shared Memory Multiprocessors

Symmetric Multiprocessors (SMPs)o Symmetric access to all of main memory from any processor

Dominate the server marketo Building blocks for larger systems; arriving to desktop

Attractive as servers and for parallel programso Fine-grain resource sharing

o Uniform access via loads/stores

o Automatic data movement and coherent replication in caches

o Useful for operating system too

Normal uniprocessor mechanisms to access data (reads and writes)o Key is extension of memory hierarchy to support multiple processors

2

3© C. Xu 1999-2005

Supporting Programming Models

o Address translation and protection in hardware (hardware SAS)

o Message passing using shared memory buffers

– can be very high performance since no OS involvement necessary

o Focus here on supporting coherent shared address space

Multiprogramming

Shared address space

Message passing Programming models

Communication abstractionUser/system boundary

Compilationor library

Operating systems support

Communication hardware

Physical communication medium

Hardware/software boundary

4© C. Xu 1999-2005

Natural Extensions of Memory System

I/O devicesMem

P1

$ $

Pn

P1

$

Interconnection network

$

Pn

Mem Mem

(b) Bus-based shared memory

(c) Dancehall

(a) Shared cache

Bus

P1

Switch

Main memory

Pn

(Interleaved)

(Interleaved)

First-level $

P1

$

Interconnection network

$

Pn

Mem Mem

(d) Distributed-memory

3

5© C. Xu 1999-2005

Caches and Cache Coherence

Caches play key role in all caseso Reduce average data access time

o Reduce bandwidth demands placed on shared interconnect

But private processor caches create a problemo Copies of a variable can be present in multiple caches

o A write by one processor may not become visible to others– They’ll keep accessing stale value in their caches

o Cache coherence problem

o Need to take actions to ensure visibility

6© C. Xu 1999-2005

Shared Cache

Advantageso Sharing of working sets

o Low-latency sharing and prefetching across proc

o No coherence problem

Disadvantageso high bandwidth needs and negative interference (e.g. conflicts)

o latency increased due to intervening switch and cache size

o Mid 80s: a couple of processors on a board (Encore, Sequent)

o Today: multiprocessor on a chip

P1

Switch

Main memory

Pn

(Interleaved)

(Interleaved)

First-level $

4

7© C. Xu 1999-2005

A Coherent Memory System

Intuitive behavior model of a memory system:

Reading a location should return the latest value

written to that location

We rely on this model to reason correctness of a

sequential program

8© C. Xu 1999-2005

Example of Cache Coherence Problem

o Processors see different values for u after event 3

o With write back caches, value written back to memory depends on

happenstance of which cache flushes or writes back value when

– Processes accessing main memory may see very stale value

o Unacceptable to programs, and frequent!

I/O devices

Memory

P1

$ $ $

P2 P3

12

34 5u= ?u= ?

u:5

u:5

u:5

u= 7

5

9© C. Xu 1999-2005

Cache Coherence in Uniprocessors for I/O

The intuitive behavior model becomes invalid in

uniprocessors for I/Oo coherence between I/O devices and processors

o But infrequent so software solutions work– uncacheable memory, uncacheable operations, flush pages, pass I/O data

through caches

But coherence problem much more critical in

multiprocessorso Pervasive

o Performance-critical

o Must be treated as a basic hardware design issue

10© C. Xu 1999-2005

Problems with the Intuition

Recall: Value returned by read should be lastest value

written. But, “lastest” is not well-defined

In seq. case, defined in terms of program order, not timeo Order of ops in machine language presented to processor

o “Subsequent” defined in analogous way, and well defined

In parallel case, program order defined within a process,

but need to make sense of orders across processes

Must define a meaningful semantics

6

11© C. Xu 1999-2005

Extension to Multiprocessors

Memory operationo a single read, write or read-modify-write to a memory location, assuming to

execute atomically w.r.t each other

Issue Event: it occurs when an operation leaves a processor’s

internal environment and is presented to memory system (cache,

buffer …)

Perform Event: the operation appears to have been performed, as far

as a processor can tell from other mem operations it issues

o A write appears to be performed when a subsequent read by any processor

returns the value of that write or a later write

o A read appears to be performed when subsequent writes issued by any

processor cannot affect the value returned by the read

Complete: have been performed with respect to all processors

12© C. Xu 1999-2005

Sharpening the Intuition

– Consider a single shared memory and no caches, memory imposes a

serial or total order on all operations to the same physical location

o Operations to the location from a given processor are in program order

o The order of operations to the location from different processors is some

interleaving that preserves the individual program orders

– “Lastest” now means most recent in a hypothetical serial order that

maintains these properties

– For the serial order to be consistent, all processors must “see” writes

to the location in the same order

– And, program should behave as if some serial order is enforced

o Order in which things appear to happen, not actually happen

7

13© C. Xu 1999-2005

Formal Definition of Coherence

A memory system is coherent if the results of any execution of a program are such that, for each location, it is possible to construct a hypothetical serial order of all operations to the location that is consistent with the results of the execution and in which:

o operations issued by a process occur in the order in which they were issued to the memory system by that process, and

o the value returned by a read is the value written by the last write to that location in the serial order

14© C. Xu 1999-2005

Two Necessary Features:

Write propagation: o value written must become visible to others

Write serialization: o writes to a location seen in same order by all

– if process P sees the w1 after w2, process Q should

not see w2 before w1

– no need for analogous read serialization since reads

not visible to others

8

15© C. Xu 1999-2005

Cache Coherence Using a Bus

Built on top of two fundamentals of uniprocessor systemso Bus transactions

o State transition diagram in cache

Uniprocessor bus transaction:o Three phases: arbitration, command/address, data transfer

o All devices observe addresses, one is responsible

Uniprocessor cache states:o Effectively, every block is a finite state machine

o Write-through, write no-allocate has two states: valid, invalid

o Writeback caches have one more state: modified (“dirty”)

Multiprocessors extend both these somewhat to implement

coherence

16© C. Xu 1999-2005

Snooping-based CoherenceBasic Idea

o Transactions on bus are visible to all processorso Processors or their representatives (cache controllers) can snoop bus and take

action on relevant events (e.g. change state)

Implementing a Protocol

o Cache controller now receives inputs from both sides:

– Requests from processor, bus requests/responses from snooper

o In either case, takes zero or more actions

– Updates state, responds with data, generates new bus transactions

o Protocol is distributed algorithm: cooperating state machines

– Set of states, state transition diagram, actions

o Granularity of coherence is typically cache block

9

17© C. Xu 1999-2005

Coherence with Write-through Caches

Key extensions to uniprocessor: snooping, invalidating/updating caches

o no new states or bus transactions in this case

o invalidation- versus update-based protocols

Write propagation: even in inval case, later reads will see new value

o inval causes miss on later access, and memory up-to-date via write-through

I/O devicesMem

P1

$

Bus snoop

$

Pn

Cache-memorytransaction

18© C. Xu 1999-2005

Write-through State Transition Diagram

o Write will invalidate all other caches (no local change of state)

– can have multiple simultaneous readers of block,but write invalidates them

I

V

PrRd/BusRd

PrRd/—

PrWr/BusWr

BusWr/--

PrOcessor-initiated transactions

Bus-snoope-initiated transactonsr

PrWr/BusWr

V: ValidI: Invalid

10

19© C. Xu 1999-2005

Is it Coherent?To show that for any execution under the protocol a total order of mem operations can be constructed that satisfies program order and write serialization.

Assume both bus transactions and memory operations are atomic. o with one-level cache, assume invalidations applied during bus xaction

o These assumptions can be relaxed in more complex systems.

Under the protocol, all writes go to buso Writes serialized by order in which they appear on bus (bus order)

o Per above assumptions, invalidations applied to caches in bus order

How to insert reads in this order?o processors see writes through reads

o But, reads to a location are not completely serialized due to read hits

20© C. Xu 1999-2005

Is it Conherent (cont’): Ordering reads

– Read misses: appear on bus, and will see last write in bus

order

– Read hits: do not appear on buso But value read was placed in cache by either

– most recent write by this processor, or

– most recent read miss by this processor

o Both these transactions appear on the bus

o So reads hits also see values as being produced in consistent bus

order

11

21© C. Xu 1999-2005

Problem with Write-Through

High bandwidth requirementso Every write from every processor goes to shared bus and memory

o Consider 200MHz, 1CPI processor, and 15% instrs. are 8-byte stores

o Each processor generates 30M stores or 240MB data per second

o How many processors can be supported by a 1GB/s bus without saturating ?

#bytes go to the bus per sec = 200M x 0.15 x 8 = 240M

#processor supported by the bus = 1GB/240M = four

Write-back caches absorb most writes as cache hitso Write hits don’t go on bus

o But now how do we ensure write propagation and serialization?

o Need more sophisticated protocols: large design space

22© C. Xu 1999-2005

Snooping Protocols for Write-Back Cache

No need to change processor, main memory, cache …o Extend cache controller and exploit bus (provides serialization)

Dirty state now also indicates exclusive ownershipo Exclusive: only cache with a valid copy (main memory may be

too)

o Owner: responsible for supplying block upon a request for it

Design spaceo Invalidation versus Update-based protocols

o Set of states

12

23© C. Xu 1999-2005

Invalidation-based ProtocolsExclusive means can modify without notifying anyone else

o i.e. without bus transaction

o Must first get block in exclusive state before writing into it

o Even if already in valid state, need transaction, so called a write miss

Store to non-dirty data generates a read-exclusive bus transaction

o Tells others about impending write, obtains exclusive ownership– makes the write visible, i.e. write is performed

– may be actually observed (by a read miss) only later

– write hit made visible (performed) when block updated in writer’s cache

o Only one RdX can succeed at a time for a block: serialized by bus

Read and Read-exclusive bus transactions drive coherence actionso Writeback transactions also, but not caused by memory operation and quite

incidental to coherence protocol– note: replaced block that is not in modified state can be dropped

24© C. Xu 1999-2005

Basic MSI Writeback Inval Protocol

Stateso Invalid (I)

o Shared (S): one or more

o Dirty or Modified (M): one only

Processor Events: o PrRd (read)o PrWr (write)

Bus Transactionso BusRd: asks for copy with no intent to modify

o BusRdX: asks for copy with intent to modify

o BusWB: updates memory

Actionso Update state, perform bus transaction, flush value onto bus

13

25© C. Xu 1999-2005

State Transition Diagram

o Write to shared block (Write miss)

– Already have latest data; can use upgrade (BusUpgr) instead of BusRdXto reduce data traffic in bus protocol

– BusUpgr obtains exclusive ownership like a BusRdX, but it doesn’t cause memory to respond with the data

o Replacement changes state of two blocks:

– Outgoing block changes its state to Invalid

– Incoming block changes its state from Invalid to a new state

PrRd/—

PrRd/—

PrWr/BusRdXBusRd/—

PrWr/—

S

M

I

BusRdX/Flush

BusRdX/—

BusRd/Flush

PrWr/BusRdX

PrRd/BusRd

26© C. Xu 1999-2005

Processor Action P1 P2 P3 Bus Action Data Supplied ByP1 reads u S -- -- BusRd Memory

P3 reads u S -- S BusRd Memory

P3 writes u I -- M BusRdx P3

P1 reads u S -- S BusRd P3 cache

P2 reads u S S S BusRd Memory *

* “P1 reads u” flushes the P3 cache and updates the memory copy

I/O devices

Memory

P1

$ $ $

P2 P3

12

34 5

u= ?u = ?

u:5

u:5

u:5

u= 7

14

27© C. Xu 1999-2005

Satisfying Coherence

Write propagation is clear: via bus read-exclusive transaction

Write serialization?o All writes that appear on the bus (BusRdX) ordered by the bus

– Write performed in writer’s cache before it handles other transactions, so ordered in

same way even w.r.t. writer

o Reads that appear on the bus ordered w.r.t. these writes that don’t appear on

the bus:– sequence of such writes between two bus xactions for the block must come from

same processor, say P

– in serialization, the sequence appears between these two bus xactions

– reads by P will seem them in this order w.r.t. other bus transactions

– reads by other processors separated from sequence by a bus xaction, which places

them in the serialized order w.r.t the writes

– so reads by all processors see writes in same order

28© C. Xu 1999-2005

MESI (4-state) Invalidation Protocol

Problem with MSI protocolo Reading and modifying data is 2 bus xactions, even if no one sharing

– e.g. even in sequential program

– BusRd (I->S) followed by BusRdX or BusUpgr (S->M)

Add exclusive state: write locally without xaction, but not modifiedo Main memory is up to date, so cache not necessarily owner

o States

– invalid

– exclusive or exclusive-clean (only this cache has copy, but not modified)

– shared (two or more caches may have copies)

– modified (dirty)

o I -> E on PrRd if noone else has copy– needs “shared” signal on bus: wired-or line asserted in response to BusRd

15

29© C. Xu 1999-2005

MESI State Transition Diagram

o An additional control signal S

to determine on a BusRd if any

other cache currently holds the

block– BusRd(S) means shared line S

asserted on BusRd transaction

– BusRd(S): shared line S not

asserted

– Plain BusRd: don’t care S

o Flush’: if cache-to-cache

sharing, only one cache flushes

data

– C2C sharing adds complexity:

mem wait until no cache will

supply data

PrW / --

BusRd/Flush

PrRd/

BusRdX/Flush

PrWr/BusRdX

PrWr/—

PrRd/—

PrRd/—BusRd/Flush′

E

M

I

S

PrRd

BusRd(S)

BusRdX/Flush’

BusRdX/Flush

BusRd/Flush

PrWr/BusRdX

PrRd/BusRd(S)

30© C. Xu 1999-2005

o Processor Action P1 P2 P3 Bus Action Data Supplied ByP1 reads u

P3 reads u

P3 writes u

P1 reads u

P2 reads u

I/O devices

Memory

P1

$ $ $

P2 P3

12

34 5

u= ?u= ?

u:5

u:5

u:5

u= 7

16

31© C. Xu 1999-2005

Update-based Protocols

A write operation updates values in other cacheso New, update bus transaction

Advantageso Other processors don’t miss on next access: reduced latency

– In invalidation protocols, they would miss and cause more transactions

o Single bus transaction to update several caches can save bandwidth– Also, only the word written is transferred, not whole block

Disadvantageso Multiple writes by same processor cause multiple update transactions

– In invalidation, first write gets exclusive ownership, others local

Detailed tradeoffs more complex

32© C. Xu 1999-2005

Dragon Write-back Update Protocol

4 stateso Exclusive-clean or exclusive (E): I and memory have it

o Shared clean (Sc): I, others, and maybe memory, but I’m not owner

o Shared modified (Sm): I and others but not memory, and I’m the owner

– Sm and Sc can coexist in different caches, with only one Sm

o Modified or dirty (D): I and, no one else

No invalid stateo If in cache, cannot be invalid

o If not present in cache, can view as being in not-present or invalid state

New processor events: PrRdMiss, PrWrMisso Introduced to specify actions when block not present in cache

New bus transaction: BusUpdo Broadcasts single word written on bus; updates other relevant caches

17

33© C. Xu 1999-2005

Dragon State Transition Diagram

E Sc

Sm M

PrWr/—PrRd/—

PrRd/—

PrRd/—

PrRdMiss/BusRd(S)PrRdMiss/BusRd(S)

PrWr/—

PrWrMiss/(BusRd(S); BusUpd) PrWrMiss/BusRd(S)

PrWr/BusUpd(S)

PrWr/BusUpd(S)

BusRd/—

BusRd/Flush

PrRd/— BusUpd/Update

BusUpd/Update

BusRd/Flush

PrWr/BusUpd(S)

PrWr/BusUpd(S)

34© C. Xu 1999-2005

o Processor Action P1 P2 P3 Bus Action Data Supplied ByP1 reads u E -- -- BusRd Memory

P3 reads u Sc -- Sc BusRd Memory

P3 writes u Sc -- Sm BusUpd P3 proc

P1 reads u Sc -- Sm null

P2 reads u Sc Sc Sm BusRd P3 cache

I/O devices

Memory

P1

$ $ $

P2 P3

12

34 5u= ?u= ?

u:5

u:5

u:5

u= 7

18

35© C. Xu 1999-2005

Invalidate versus Update

Basic question of program behavioro Is a block written by one processor read by others before it is rewritten?

Invalidation:

o Yes => readers will take a miss

o No => multiple writes without additional traffic

– and clears out copies that won’t be used again

Update:

o Yes => readers will not miss if they had a copy previously

– single bus transaction to update all copies

o No => multiple useless updates, even to dead copies

36© C. Xu 1999-2005

Update versus Invalidate

Much debate over the years: tradeoff depends on sharing patterns

Intuition:o If those that used continue to use, and writes between use are few, update

should do better– e.g. producer-consumer pattern

o If those that use unlikely to use again, or many writes between reads, updates

not good– “pack rat” phenomenon particularly bad under process migration

– useless updates where only last one will be used

Can construct scenarios where one or other is much better

Invalidation protocols much more popular (more later)o Some systems provide both, or even hybrid

o E.g. competitive: observe patterns at runtime and change protocol

19

37© C. Xu 1999-2005

Example:

Pattern 1: Repeat k times: P1 write x, followed by a sequence of reads from processor 2 to P

Pattern 2: Repeat k times: P1 writes x M times and then P2 reads the value

What is the relative cost for update- and inval-basedprotocol in terms of number of cache misses and bustraffic?

Assume an inval/upgrade transaction consumes 6 (5+1) bytesan update takes 14 (6+ 8) bytesa regular cache miss takes 70 (6+64) bytesassume P=16, M=10, k=10

38© C. Xu 1999-2005

Pattern 2 with Update Scheme

K= 1 2 3 … … 10w1 miss update updatew1 hit update updatew1 hit update update……w1 hit update updater2 miss hit hit

traffic = 2 misses x 70 + (k-1)x(m-1)updates x 14

20

39© C. Xu 1999-2005

Pattern 2 with MESI Invalidation Scheme

K= 1 2 3 …… 10w1 miss inval inva invalw1 hit hit hit hitw1 hit hit hit hit……w1 hit hit hit hitr2 miss miss miss miss

Traffic= 11 misses x 70 + (k-1) Upgrade

40© C. Xu 1999-2005

Pattern 1:With an update scheme:

Number of misses = P; #Update= K-1 Traffic = P * RdMiss + (k-1) Update

With an invalidate scheme:Number of misses = P + (k-1)*p=16+9*16=160Traffic = read misses * RdMiss + (k-1)Upgrade

= 151*70 + 9*6 = 10,642Pattern 2:

With an update scheme:Number of misses = 2Traffic = 2*RdMiss+M*(k-1)Update

= 2*70+10*9*14 = 1,400 bytesWith an invalidate scheme

Number of misses = 2 + (k-1) = 11Traffic = misses * RdMiss + (k-1) Upgrade

= 11*70 + 9*6 = 824 bytes

21

41© C. Xu 1999-2005

Classification of Cache Misses

Multiprocessors add new kind of miss to cold, capacity, conflicto Cold misses (compulsory misses) occur on the first reference to a mem block

by a processor

o Capacity misses occur when all the blocks that are referenced by a proc during the execution of a program do not fit in the cache

o Conflict misses occur in caches with less than full associativity

o Coherence misses: true sharing and false sharing– True sharing occurs when a word produced by one proc is used by another.

It is inherent to a given parallel decomposition and assignement

– False sharing occurs when independent data words accessed by different processors happen to be placed in the same block, and at least one of the accesses is a write. False sharing misses are an example of the artifactualcommunication.

42© C. Xu 1999-2005

Example:

Assume that words x1 and x2 are in the same cache block in a clean state in the caches of P1 and P2, which have previously read x1 and x2. Assuming the following sequence of event, identify each miss as a true sharing miss, a false sharing miss,or a hit. Any miss that would occur if the block size were oneword is designated as a true sharing miss.

Time P1 P21 Write x12 Read x23 Write x14 Write x25 Read x2

True sharingFalse sharingFalse sharingFalse sharingTrue sharing

22

43© C. Xu 1999-2005

A Classification of Cache Misses

o Many mixed categories because a

miss may have multiple causes

Miss classi cation

Reasonfor miss

First reference tomemory block by processor

First accesssystemwide

Yes

No

Writtenbefore

Yes

No

Modi ed word(s) accessedduring lifetime

Yes

No

1. Cold

2. Cold

4. True-sharing-

3. False-sharing-

Reason forelimination of

last copy

Replacement

Invalidation

Old copywith state = invalid

still there

YesNo

8. Pure-7. Pure-

6. True-sharing-inval-cap

5. False-sharing-inval-cap



Yes

No YesNo

false-sharingtrue-sharing

Has blockbeen modi ed since

replacement

No Yes

10. True-sharing-9. Pure-

12. True-sharing-11. False-sharing-

Modi ed word(s) accessedduring lifetime Modi ed

word(s) accessedduring lifetime

YesNo

YesNo

capacity

Other

cold

cold

cap-inval cap-inval

capacity

44© C. Xu 1999-2005

Impact of Cache Block Size

Reducing misses architecturally in invalidation protocol

o Capacity: enlarge cache; increase block size (if spatial locality)

o Conflict: increase associativity

o Cold and Coherence: only block size

Increasing block size has advantages and disadvantages

o Can reduce misses if spatial locality is good

o Can hurt too

– increase misses due to false sharing if spatial locality not good

– increase misses due to conflicts in fixed-size cache

– increase traffic due to fetching unnecessary data and due to false sharing

– can increase miss penalty and perhaps hit cost

23

45© C. Xu 1999-2005

Performance Measurement: An Example

AlphaServer 4100o 4 Alpha 21164 processor, 300MHz

o Each proc has a 3-level cache hierarchy

o L1: 8K DM I-cache, 8K DM D-cache, block size=32 byte, D-cache is write

through, using a write buffer

o L2: 96K on-chip unified 3-way set-assoc cache, 32byte block size, write back

o L3: off-chip, combined DM cache, 2M bytes, 64 byte block size, write back

o Latency of an access– 7 cycles to L2,

– 21 cycles to L3

– 80 cycles to main memory

– Cache-to-cache transfer takes 125 cycles

46© C. Xu 1999-2005

Execution Breakdown

Benchmarks:o OLTP: online transaction

processing, CPI=7.0o DSS: a decision support

systems, based on TPC-D, CPI=1.6

o AV: a Web index search (AltaVista), CPI=1.3

Very poor perf in OLTP, because large numbers of expensive L3 misses

24

47© C. Xu 1999-2005

Cache Miss Sources

Contributing causes of mem access

cycles shift as the cache size is

increased

o L3 cache: 2-way SA

Two large sources shrink

o Instr and capcity/conflict

But cold, false sharing, true sharing

misses are unaffected

True sharing becomes a dominant

factor in large cache

increasing cache size eliminates

most of the uni-proc misses, while

leaving the MP misses untouched.

48© C. Xu 1999-2005

Impact of True Sharing

As the number of processors

increases, true sharing miss

rate increases, which is not

compensated for by any

decrease in the uniproc misses

Overall, # memory access

cycles per instruction increases

25

49© C. Xu 1999-2005

Impact of Block Size

True sharing miss rate decreases by more than a factor of 2, indicating locality in the true sharing patternsCold start miss rate significantly

reducedConflict/capacity misses show a

small decrease, indicating the spatial locality is not high in the uni-proc missesFalse sharing miss rate nearly

doubles, although small in absolute terms.

50© C. Xu 1999-2005

Exploiting Spatial Locality

Besides capacity, granularities are important:o Granularity of allocation

o Granularity of communication or data transfer

o Granularity of coherence

Major spatial-related causes of artifactual communication:o Conflict misses

o Data distribution/layout (allocation granularity)

o Fragmentation (communication granularity)

o False sharing of data (coherence granularity)

All depend on how spatial access patterns interact with data structureso Fix problems by modifying data structures, or layout/alignment

26

51© C. Xu 1999-2005

Spatial Locality Example

Repeated sweeps over 2-d grid, each time adding 1 to elements

Natural 2-d versus higher-dimensional array representation

P6 P7P4

P8

P0 P3

P5 P6 P7P4

P8

P0 P1 P2 P3

P5

P2P1

Page straddlespartition boundaries:difficult to distribute memory well

Cache blockstraddles partitionboundary

(a) Two-dimensional array

Page doesnot straddlepartitionboundary

Cache block is within a partition

(b) Four-dimensional array

Contiguity in memory layout

52© C. Xu 1999-2005

Example Performance Impact

Equation solver on SGI Origin2000

Spe

edup

Number of processors

Spe

edup

Number of processors

lll

l

l

l

ss

s

s

s

s

nnn

n

n

n

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310

5

10

15

20

25

30

lll

l

l

l

l

nnn

n

n

n

n

sss

s

s

s

s

uuu

u

u

u

u

666

6

6

6

6

1 3 5 7 9 11 1315 1719 21 2325 27 29310

5

10

15

20

25

30

35

40

45

504D

4D-rr

n 2D-rr

s 2D

u Rows-rr

6 Rowsl 2D

n 4D

s Rows

27

53© C. Xu 1999-2005

Summary

Cache organizationo Shared cache vs distributed cache

Snoopy cache coherence protocolo MESI protocol

o Dragon protocol

Cache misseso True sharing vs false sharing

Architectual implicationso locality exploitation to reduce artificial communication