Overview: Shared Memory Hardware · 2020. 4. 24. · Overview: Shared Memory Hardware l overview of shared address space systems l example: cache hierarchy of the Intel Core i7 l

Overview: Shared Memory Hardware

l overview of shared address space systems

l example: cache hierarchy of the Intel Core i7

l cache coherency protocols: basic ideas, invalidate and update protocols

l false sharing

l MSI protocol implementationl snoopy cache-based systems

l directory cache-based systems, and their cache coherency issues

l cache coherency protocols in practice

l HPC study article: 12 Ways to Fool the Masses: Fast Forward to 2011

Refs:

l Lin & Snyder Ch 2, Grama et al Ch 2

l SGI Origin architecture, AMD Northbridge architecture, Intel Quickpath technology

News: Giant Chip for Deep Learning (40K cores, 18G SRAM, 10TB/s internal network)

COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 1

http://cs.anu.edu.au/courses/comp4300/refs/dhb-12ways.pdfhttp://cs.anu.edu.au/courses/comp4300/refs/SGIisca.pdfhttp://cs.anu.edu.au/courses/comp4300/refs/AMDNorthbridge.pdfhttp://www.intel.com/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdfhttps://spectrum.ieee.org/semiconductors/processors/cerebrass-giant-chip-will-smash-deep-learnings-speed-barrier

Shared Memory Hardware

(Fig 2.5 Grama et al, Intro to Parallel Computing)


Shared Address Space Systems

l systems with caches but otherwise flat memory are generally called UMA

l if access to local memory is cheaper than remote (NUMA), this should be built into

your algorithm

n how to do this and O/S support is another matter

n man numa will give details of NUMA support

l a global address space is considered easier to program

n read-only interactions invisible to programmer and can be coded like a

sequential program

n read/write are harder, require mutual exclusion for concurrent accesses

l the main programming models are threads and directive-based (we will use

Pthreads and OpenMP)

l synchronization uses locks and related mechanisms


Shared Address Space and Shared Memory Computers

l shared memory was historically used for architectures in which memory is

physically shared among various processors, and all processors have equal access

to any memory segment

n this is identical to the UMA model

n the term SMP originally meant a Symmetric Multi-Processor:

all CPUs had equal OS capabilities (interrupts, I/0 & other system calls).

It now means Shared Memory Processor (almost all are ‘symmetric’)

l c.f. distributed-memory computers where different memory segments are physically

associated with different processing elements.

l either of these physical models can present the logical view of a disjoint or

shared-address space platform

n a distributed-memory shared-address-space computer is a NUMA system


Cache Hierarchy on the Intel Core i7 (2013)

(64 byte cache line size)Ref: http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1


Caches on Multiprocessors

l issue: multiple copies of some data word being manipulated by two or more

processors at the same time

l two requirements:

n an address translation mechanism that locates each physical memory word in

system

n concurrent operations on multiple copies have well defined semantics

l the latter is generally known as a cache coherency protocol

n input/output using direct memory access (DMA) on machines with caches also

leads to coherency issues

l some machines only provide shared address space mechanisms and leave

coherence to (system or user-level) software

n e.g. Texas Instrument Keystone II system, Intel Single Chip Cloud Computer


http://cs.anu.edu.au/courses/comp4300/refs/intel-scc-overview.pdf

Cache Coherency

l intuitive behaviour: reading value at address X should return the last value written

to address X by any processor

n what does ‘last’ mean? What if simultaneous or closer in time than the time

required to communicate between two processors?

l in a sequential program, ‘last’ is determined by program order (not time)

n holds true within one thread of a parallel program, but what does this mean with

multiple threads?


Cache/Memory Coherency

l a memory system is coherent if:

n Ordered as Issued: a read by processor P to address X that follows a write by P

to address X should return the value of the write by P (assuming no other

processor writes to X in between)

n Write Propagation: a read by processor P1 to address X that follows a write by

processor P2 to X returns the written value if the read and write are sufficiently

separated in time (assuming no other write to X occurs in between)

n Write Serialization: writes to the same address are serialized: two writes to any

two processors are observed in the same order by all processors

l (later to be contrasted with memory consistency!)


Two Cache Coherency Protocols



Cache Line View

Ref: http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1

l need to augment cache line information with information regarding validity


Update vs Invalidate

l update protocol:

n when a data item is written, all of its copies in the system are updated

l invalidate protocol (most common):

n before a data item is written, all other copies are marked as invalid

l comparison:

n multiple writes to same word with no intervening reads require multiple write

broadcasts in an update protocol, but only one initial invalidation

n with multi-word cache blocks, each word written in a cache block (line) must be

broadcast in an update protocol, but only one invalidate per line is required

n the delay between writing a word in one processor and reading the written data

in another is usually less for the update protocol


False Sharing

l two processors modify different parts of the

same cache line

l the invalidate protocol leads to ping-ponged

cache lines

l the update protocol performs reads locally but

updates much traffic between processors

l this effect is entirely an artefact of hardware

l need to design parallel systems/programs with

this issue in mind

n cache line size: the longer, the more likely

n alignment of data structures with respect to

cache line size

http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1


Implementing Cache Coherency

l on small-scale bus-based machines:

n a processor must obtain access to the bus to broadcast a write invalidation

n with two competing processors, the first to gain access to the bus will invalidate

the others data

l a cache miss needs to locate the ‘top’ copy of the data

n easy for a write-through cache

n for a write-back cache, each processor’s cache snoops the bus and responds if

it has the top copy the data

l for writes, we would like to know if any other copies of the line (block) are cached

n i.e. whether a write-back cache needs to put details on the bus

n handled by having a bit in the line state to indicate shared status

l minimizing processor stalls

n by more detailed tracking of the line state or having (multi-level) inclusive

caches


3 State (MSI) Cache Coherency Protocol

l read: local read

l write: local write

l c read (coherency read): a

read (miss) on a remote

processor gives rise to shown

transition in local cache

l c write (coherency write): a

write miss or write in Shared

state on a remote processor,

gives rise to shown transition

in local cache



MSI Coherency Protocol


Discussion Point: review invalidate vs update protocols, both using the

MSI protocol. Also for false sharing.


Snoopy Cache Systems

l all caches broadcast all transactions (read or write misses, writes in the S state)

n suited (and easy to implement) to bus or ring interconnects

n however their scalability is limited (i.e. ≤ 8 processors)What about torus on-chip networks? (assume wormhole routing)

l all processor’s caches monitor the bus (or interconnect port) for transactions of

interest

l each processor’s cache has a set of tag bits that determine the state of the cache

block

n tags are updated according to the state diagram for the relevant protocol

n e.g. when the snoop hardware detects that a read has been issued for a cache

block that it has a dirty copy of, it asserts control of the bus and puts data out

(to the requesting cache and to main memory), and sets the tag to the S state

l what sort of data access characteristics are likely to perform well/badly on

snoopy-based systems?


Snoopy Cache-Based System: Bus



Snoopy Cache-Based System: Ring

The Core i7 (Sandy Bridge) on-chip interconnect:

l a ring-based interconnect between Cores, Graphics,

Last Level Cache (LLC) and System Agent domains

l has 4 physical rings: Data (32B), Request,

Acknowledge and Snoop rings

l fully pipelined; bandwidth, latency and power scale

with cores

l shortest path chosen to minimize latency

l has distributed arbitration & sophisticated protocols

to handle coherency and ordering

(courtesy www.lostcircuits.com)


Directory Cache-Based Systems

l broadcasting is clearly not scalable

n a solution is to only send information to the processing elements specifically

interested in that data

l this requires a directory to store the necessary information

n augment global memory with a presence bitmap to indicate in which caches

each memory block is located in


Directory-Based Cache Coherency



Directory-Based Cache Coherency

l a simple protocol might be:

n shared: one or more processors have the block cached, and the value in

memory is up to date

n uncached: no processor has a copy

n exclusive: only one processor (the owner) has a copy and the value in memory

is out of date

l must handle a read/write miss and a write to a shared, clean cache block

n these first reference the directory entry to determine the current state of this

block

n then update the entry’s status and presence bitmap

n send the appropriate state update transactions to the processors in the

presence bitmap


Issues in Directory-Based Systems

l how much memory is required to store the directory?

l what sort of data access characteristics are likely to perform well/badly on

directory-based systems?

n how do distributed and centralized systems compare?

l should the presence bitmaps be replicated in the caches? Must they be?

l how would you implement sending an invalidation message to all (and only to all)

processors in the presence bitmap?


Costs on SGI Origin 3000 (clock cycles)

≤ 16 CPUs > 16 CPUscache hit 1 1cache miss to local memory 85 85cache miss to remote home directory 125 150cache miss to remotely cached data (3 hops) 140 170

Figure from http://people.nas.nasa.gov/∼schang/origin opt.htmlData from: Computer Architecture: A Quantitative Approach, By David A. Patterson, John L. Hennessy, David Goldberg Ed 3,Morgan Kaufmann, 2003


Real Cache Coherency Protocols

l from Wikipedia:

n Modern systems use variants of the MSI protocol to reduce the amount of traffic

in the coherency interconnect. The MESI protocol adds an “Exclusive” state to

reduce the traffic caused by writes of blocks that only exist in one cache. The

MOSI protocol adds an “Owned” state to reduce the traffic caused by

write-backs of blocks that are read by other caches [The processor owner of the

cache line services requests for that data]. The MOESI protocol does both of

these things. The MESIF protocol uses the “Forward” state to reduce the traffic

caused by multiple responses to read requests when the coherency

architecture allows caches to respond to snoop requests with data.

l case study: coherency via the MOESI protocol in the SunFire V1289 NUMA SMP

(2005)


http://users.cecs.anu.edu.au/~peter/papers/memsim-talk.pdfhttp://users.cecs.anu.edu.au/~peter/papers/memsim-talk.pdf

MESI Protocol (on a bus)

(in ‘ S → (PW)→ E ’and ‘ I → (PW/S)→ E ’,replace E with M )

Ref: https://www.cs.tcd.ie/Jeremy.Jones/vivio/caches/MESIHelp.htm


Multi-Level Caches

l what is the visibility of changes between levels of cache?

http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1

l easiest model is inclusive:

n if line is in owned state in L1, it is also in owned state in L2


The Coherency Wall: Cache Coherency Considered Harmful

l interconnects are expected to consume 50× more energy than logic circuits

l standard protocols requires a broadcast message for each invalidation

n maintaining (MOESI) protocol also requires a broadcast on every miss

n energy cost of each is O(p); overall cost is O(p2)!n also causes contention (& delay) in the network (worse than O(p2)?)

l directory-based protocols can direct invalidation messages to only the caches

holding the same data

n far more scalable, for lightly-shared data

n worse otherwise; also introduces overhead through indirection

n for each cached line, need a bit vector of length p: O(p2) storage cost

l false sharing in any case results in wasted traffic

l atomic instructions (essential for locks etc) sync the memory system down to the

LLC, cost O(p) energy each!

l cache line size is sub-optimal for messages on on-chip networks


Cache Coherency Summary

l cache coherency arises because abstraction of a single shared address space isnot actually implemented by a single storage unit in a machine

l three components to cache coherency:

n issue order, write propagation, write serialization

l two implementations:

n broadcast/snoop: suitable for small-medium intra-chip and small inter-socketsystems

n directory-based: suitable for medium-large inter-socket systems

l false sharing is a potential performance issue

n more likely, the longer the cache line

l energy considerations argue for no coherency for large intra-chip systems, e.g. thePEZY-Sc

n OS-managed distributed shared memory or message-passing programmingmodels

https://en.wikichip.org/wiki/pezy/pezy-sc#Architecture

Documents

Overview: Shared Memory Hardware · 2020. 4. 24. · Overview: Shared Memory Hardware l overview of shared address space systems l example: cache hierarchy of the Intel Core i7 l