28
Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing MSI protocol implementation snoopy cache-based systems directory cache-based systems, and their cache coherency issues cache coherency protocols in practice HPC study article: 12 Ways to Fool the Masses: Fast Forward to 2011 Refs: Lin & Snyder Ch 2, Grama et al Ch 2 SGI Origin architecture, AMD Northbridge architecture, Intel Quickpath technology News: Giant Chip for Deep Learning (40K cores, 18G SRAM, 10TB/s internal network) COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J I II × 1

Overview: Shared Memory Hardware · 2020. 4. 24. · Overview: Shared Memory Hardware l overview of shared address space systems l example: cache hierarchy of the Intel Core i7 l

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • Overview: Shared Memory Hardware

    l overview of shared address space systems

    l example: cache hierarchy of the Intel Core i7

    l cache coherency protocols: basic ideas, invalidate and update protocols

    l false sharing

    l MSI protocol implementationl snoopy cache-based systems

    l directory cache-based systems, and their cache coherency issues

    l cache coherency protocols in practice

    l HPC study article: 12 Ways to Fool the Masses: Fast Forward to 2011

    Refs:

    l Lin & Snyder Ch 2, Grama et al Ch 2

    l SGI Origin architecture, AMD Northbridge architecture, Intel Quickpath technology

    News: Giant Chip for Deep Learning (40K cores, 18G SRAM, 10TB/s internal network)

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 1

    http://cs.anu.edu.au/courses/comp4300/refs/dhb-12ways.pdfhttp://cs.anu.edu.au/courses/comp4300/refs/SGIisca.pdfhttp://cs.anu.edu.au/courses/comp4300/refs/AMDNorthbridge.pdfhttp://www.intel.com/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdfhttps://spectrum.ieee.org/semiconductors/processors/cerebrass-giant-chip-will-smash-deep-learnings-speed-barrier

  • Shared Memory Hardware

    (Fig 2.5 Grama et al, Intro to Parallel Computing)

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 2

  • Shared Address Space Systems

    l systems with caches but otherwise flat memory are generally called UMA

    l if access to local memory is cheaper than remote (NUMA), this should be built into

    your algorithm

    n how to do this and O/S support is another matter

    n man numa will give details of NUMA support

    l a global address space is considered easier to program

    n read-only interactions invisible to programmer and can be coded like a

    sequential program

    n read/write are harder, require mutual exclusion for concurrent accesses

    l the main programming models are threads and directive-based (we will use

    Pthreads and OpenMP)

    l synchronization uses locks and related mechanisms

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 3

  • Shared Address Space and Shared Memory Computers

    l shared memory was historically used for architectures in which memory is

    physically shared among various processors, and all processors have equal access

    to any memory segment

    n this is identical to the UMA model

    n the term SMP originally meant a Symmetric Multi-Processor:

    all CPUs had equal OS capabilities (interrupts, I/0 & other system calls).

    It now means Shared Memory Processor (almost all are ‘symmetric’)

    l c.f. distributed-memory computers where different memory segments are physically

    associated with different processing elements.

    l either of these physical models can present the logical view of a disjoint or

    shared-address space platform

    n a distributed-memory shared-address-space computer is a NUMA system

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 4

  • Cache Hierarchy on the Intel Core i7 (2013)

    (64 byte cache line size)Ref: http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 5

  • Caches on Multiprocessors

    l issue: multiple copies of some data word being manipulated by two or more

    processors at the same time

    l two requirements:

    n an address translation mechanism that locates each physical memory word in

    system

    n concurrent operations on multiple copies have well defined semantics

    l the latter is generally known as a cache coherency protocol

    n input/output using direct memory access (DMA) on machines with caches also

    leads to coherency issues

    l some machines only provide shared address space mechanisms and leave

    coherence to (system or user-level) software

    n e.g. Texas Instrument Keystone II system, Intel Single Chip Cloud Computer

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 6

    http://cs.anu.edu.au/courses/comp4300/refs/intel-scc-overview.pdf

  • Cache Coherency

    l intuitive behaviour: reading value at address X should return the last value written

    to address X by any processor

    n what does ‘last’ mean? What if simultaneous or closer in time than the time

    required to communicate between two processors?

    l in a sequential program, ‘last’ is determined by program order (not time)

    n holds true within one thread of a parallel program, but what does this mean with

    multiple threads?

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 7

  • Cache/Memory Coherency

    l a memory system is coherent if:

    n Ordered as Issued: a read by processor P to address X that follows a write by P

    to address X should return the value of the write by P (assuming no other

    processor writes to X in between)

    n Write Propagation: a read by processor P1 to address X that follows a write by

    processor P2 to X returns the written value if the read and write are sufficiently

    separated in time (assuming no other write to X occurs in between)

    n Write Serialization: writes to the same address are serialized: two writes to any

    two processors are observed in the same order by all processors

    l (later to be contrasted with memory consistency!)

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 8

  • Two Cache Coherency Protocols

    (Fig 2.21 Grama et al, Intro to Parallel Computing)

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 9

  • Cache Line View

    Ref: http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1

    l need to augment cache line information with information regarding validity

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 10

  • Update vs Invalidate

    l update protocol:

    n when a data item is written, all of its copies in the system are updated

    l invalidate protocol (most common):

    n before a data item is written, all other copies are marked as invalid

    l comparison:

    n multiple writes to same word with no intervening reads require multiple write

    broadcasts in an update protocol, but only one initial invalidation

    n with multi-word cache blocks, each word written in a cache block (line) must be

    broadcast in an update protocol, but only one invalidate per line is required

    n the delay between writing a word in one processor and reading the written data

    in another is usually less for the update protocol

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 11

  • False Sharing

    l two processors modify different parts of the

    same cache line

    l the invalidate protocol leads to ping-ponged

    cache lines

    l the update protocol performs reads locally but

    updates much traffic between processors

    l this effect is entirely an artefact of hardware

    l need to design parallel systems/programs with

    this issue in mind

    n cache line size: the longer, the more likely

    n alignment of data structures with respect to

    cache line size

    http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 12

  • Implementing Cache Coherency

    l on small-scale bus-based machines:

    n a processor must obtain access to the bus to broadcast a write invalidation

    n with two competing processors, the first to gain access to the bus will invalidate

    the others data

    l a cache miss needs to locate the ‘top’ copy of the data

    n easy for a write-through cache

    n for a write-back cache, each processor’s cache snoops the bus and responds if

    it has the top copy the data

    l for writes, we would like to know if any other copies of the line (block) are cached

    n i.e. whether a write-back cache needs to put details on the bus

    n handled by having a bit in the line state to indicate shared status

    l minimizing processor stalls

    n by more detailed tracking of the line state or having (multi-level) inclusive

    caches

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 13

  • 3 State (MSI) Cache Coherency Protocol

    l read: local read

    l write: local write

    l c read (coherency read): a

    read (miss) on a remote

    processor gives rise to shown

    transition in local cache

    l c write (coherency write): a

    write miss or write in Shared

    state on a remote processor,

    gives rise to shown transition

    in local cache

    (Fig 2.22 Grama et al, Intro to Parallel Computing)

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 14

  • MSI Coherency Protocol

    (Fig 2.23 Grama et al, Intro to Parallel Computing)

    Discussion Point: review invalidate vs update protocols, both using the

    MSI protocol. Also for false sharing.

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 15

  • Snoopy Cache Systems

    l all caches broadcast all transactions (read or write misses, writes in the S state)

    n suited (and easy to implement) to bus or ring interconnects

    n however their scalability is limited (i.e. ≤ 8 processors)What about torus on-chip networks? (assume wormhole routing)

    l all processor’s caches monitor the bus (or interconnect port) for transactions of

    interest

    l each processor’s cache has a set of tag bits that determine the state of the cache

    block

    n tags are updated according to the state diagram for the relevant protocol

    n e.g. when the snoop hardware detects that a read has been issued for a cache

    block that it has a dirty copy of, it asserts control of the bus and puts data out

    (to the requesting cache and to main memory), and sets the tag to the S state

    l what sort of data access characteristics are likely to perform well/badly on

    snoopy-based systems?

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 16

  • Snoopy Cache-Based System: Bus

    (Fig 2.24 Grama et al, Intro to Parallel Computing)

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 17

  • Snoopy Cache-Based System: Ring

    The Core i7 (Sandy Bridge) on-chip interconnect:

    l a ring-based interconnect between Cores, Graphics,

    Last Level Cache (LLC) and System Agent domains

    l has 4 physical rings: Data (32B), Request,

    Acknowledge and Snoop rings

    l fully pipelined; bandwidth, latency and power scale

    with cores

    l shortest path chosen to minimize latency

    l has distributed arbitration & sophisticated protocols

    to handle coherency and ordering

    (courtesy www.lostcircuits.com)

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 18

  • Directory Cache-Based Systems

    l broadcasting is clearly not scalable

    n a solution is to only send information to the processing elements specifically

    interested in that data

    l this requires a directory to store the necessary information

    n augment global memory with a presence bitmap to indicate in which caches

    each memory block is located in

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 19

  • Directory-Based Cache Coherency

    (Fig 2.25 Grama et al, Intro to Parallel Computing)

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 20

  • Directory-Based Cache Coherency

    l a simple protocol might be:

    n shared: one or more processors have the block cached, and the value in

    memory is up to date

    n uncached: no processor has a copy

    n exclusive: only one processor (the owner) has a copy and the value in memory

    is out of date

    l must handle a read/write miss and a write to a shared, clean cache block

    n these first reference the directory entry to determine the current state of this

    block

    n then update the entry’s status and presence bitmap

    n send the appropriate state update transactions to the processors in the

    presence bitmap

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 21

  • Issues in Directory-Based Systems

    l how much memory is required to store the directory?

    l what sort of data access characteristics are likely to perform well/badly on

    directory-based systems?

    n how do distributed and centralized systems compare?

    l should the presence bitmaps be replicated in the caches? Must they be?

    l how would you implement sending an invalidation message to all (and only to all)

    processors in the presence bitmap?

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 22

  • Costs on SGI Origin 3000 (clock cycles)

    ≤ 16 CPUs > 16 CPUscache hit 1 1cache miss to local memory 85 85cache miss to remote home directory 125 150cache miss to remotely cached data (3 hops) 140 170

    Figure from http://people.nas.nasa.gov/∼schang/origin opt.htmlData from: Computer Architecture: A Quantitative Approach, By David A. Patterson, John L. Hennessy, David Goldberg Ed 3,Morgan Kaufmann, 2003

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 23

  • Real Cache Coherency Protocols

    l from Wikipedia:

    n Modern systems use variants of the MSI protocol to reduce the amount of traffic

    in the coherency interconnect. The MESI protocol adds an “Exclusive” state to

    reduce the traffic caused by writes of blocks that only exist in one cache. The

    MOSI protocol adds an “Owned” state to reduce the traffic caused by

    write-backs of blocks that are read by other caches [The processor owner of the

    cache line services requests for that data]. The MOESI protocol does both of

    these things. The MESIF protocol uses the “Forward” state to reduce the traffic

    caused by multiple responses to read requests when the coherency

    architecture allows caches to respond to snoop requests with data.

    l case study: coherency via the MOESI protocol in the SunFire V1289 NUMA SMP

    (2005)

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 24

    http://users.cecs.anu.edu.au/~peter/papers/memsim-talk.pdfhttp://users.cecs.anu.edu.au/~peter/papers/memsim-talk.pdf

  • MESI Protocol (on a bus)

    (in ‘ S → (PW)→ E ’and ‘ I → (PW/S)→ E ’,replace E with M )

    Ref: https://www.cs.tcd.ie/Jeremy.Jones/vivio/caches/MESIHelp.htm

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 25

  • Multi-Level Caches

    l what is the visibility of changes between levels of cache?

    http://15418.courses.cs.cmu.edu/spring2015/lecture/cachecoherence1

    l easiest model is inclusive:

    n if line is in owned state in L1, it is also in owned state in L2

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 26

  • The Coherency Wall: Cache Coherency Considered Harmful

    l interconnects are expected to consume 50× more energy than logic circuits

    l standard protocols requires a broadcast message for each invalidation

    n maintaining (MOESI) protocol also requires a broadcast on every miss

    n energy cost of each is O(p); overall cost is O(p2)!n also causes contention (& delay) in the network (worse than O(p2)?)

    l directory-based protocols can direct invalidation messages to only the caches

    holding the same data

    n far more scalable, for lightly-shared data

    n worse otherwise; also introduces overhead through indirection

    n for each cached line, need a bit vector of length p: O(p2) storage cost

    l false sharing in any case results in wasted traffic

    l atomic instructions (essential for locks etc) sync the memory system down to the

    LLC, cost O(p) energy each!

    l cache line size is sub-optimal for messages on on-chip networks

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 27

  • Cache Coherency Summary

    l cache coherency arises because abstraction of a single shared address space isnot actually implemented by a single storage unit in a machine

    l three components to cache coherency:

    n issue order, write propagation, write serialization

    l two implementations:

    n broadcast/snoop: suitable for small-medium intra-chip and small inter-socketsystems

    n directory-based: suitable for medium-large inter-socket systems

    l false sharing is a potential performance issue

    n more likely, the longer the cache line

    l energy considerations argue for no coherency for large intra-chip systems, e.g. thePEZY-Sc

    n OS-managed distributed shared memory or message-passing programmingmodels

    COMP4300/8300 L13,14: Shared Memory Hardware 2021 JJ J • I II × 28

    https://en.wikichip.org/wiki/pezy/pezy-sc#Architecture