40
2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins

2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

2 • Introduction to parallel computing

Chip Multiprocessors (ACS MPhil)

Robert Mullins

Page 2: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 2

Overview

• Parallel computing platforms– Approaches to building parallel computers– Today's chip-multiprocessor architectures

• Approaches to parallel programming– Programming with threads and shared memory – Message-passing libraries– PGAS languages– High-level parallel languages

Page 3: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 3

Parallel computers

• How might we exploit multiple processing elements and memories in order to complete a large computation quickly?– How many processing elements, how powerful?– How do they communicate and cooperate?

• How are memories and processing elements interconnected?• How is the memory hierarchy organised?

– How might we program such a machine?

Page 4: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 4

The control structure

• How are the processing elements controlled?– Centrally from single control unit or can they work

independently?• Flynn's taxonomy:• Single Instruction Multiple Data (SIMD)• Multiple Instruction Multiple Data (MIMD)

Page 5: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 5

The control structure

• SIMD – The scalar pipelines

execute in lockstep– Data-independent logic is

shared• Efficient for highly data

parallel applications• Much simpler instruction

fetch and supply mechanism

– SIMD hardware can support a SPMD model if the individual threads follow similar control flow

• Masked execution Reproduced from, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow", W. W. L. Fung et al

A Generic Streaming Multiprocessor (for graphics applications)

Page 6: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 6

The communication model

• A clear distinction is made between two common communication models:– 1. Shared-address-space platforms

• All processors have access to a shared data space accessed via a shared address space

• All communication takes place via a shared memory• Each processing element may also have an area of

memory that is private

Page 7: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 7

The communication model

• 2. Message-passing platforms– Each processing element has its own exclusive

address space– Communication is achieved by sending explicit

messages between processing elements– The sending and receiving of messages can be used

to both communicate between and synchronize the actions of multiple processing elements

Page 8: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 8

Multi-core

Figure courtesy of Tim Harris, MSR

Page 9: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 9

SMP multiprocessor

Figure courtesy of Tim Harris, MSR

Page 10: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 10

NUMA multiprocessor

Figure courtesy of Tim Harris, MSR

Page 11: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 11

Message-passing platforms

• Many early message-passing machines provided hardware primitives that were close to the send/receive user-level communication commands– e.g. a pair of processors

may be interconnected with a hardware FIFO queue

– The network topology restricted which processors could be named in a send or receive operation (e.g. only neighbours could communicate in a mesh network)

000001

010011

100

110

101

111

[Culler, Figure 1.22]

Page 12: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 12

Message-passing platforms

• The Transputer (1984)– The result of an earlier foray into the world of parallel

computing!– Transputer contained integrated serial links for building

multiprocessors• IN/OUT instructions in ISA for sending and receiving

messages

– Programmed in OCCAM (based on CSP)• IBM Victor V256 (1991)

– 16x16 array of transputers– The processors could be partitioned dynamically

between different users

Page 13: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 13

Message-passing platforms

• Recently some chip-multiprocessors have taken a similar approach (RAW/Tilera and XMOS) – Message queues (or

communication channels) may be register mapped or accessed via special instructions

– The processor stalls when reading an empty input queue or when trying to write to a full output buffer

A wireless application mapped to the RAW processor. Data is streamed from one core to another over a statically scheduled network. Network inputand output is register mapped.(See also the iWarp paper on wiki)

Page 14: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 14

Message-passing platforms

• For larger message-passing machines (typically scientific supercomputers) direct FIFO designs were soon replaced by designs that built message-passing upon remote memory copies (supported by DMA or a more general communication assist processor)– The interconnection networks also became more

powerful, supporting the automatic routing of messages between arbitrary nodes

– No restrictions on programmer or software support required

• Hardware and software evolution meant there was a general convergence of parallel machine organisations

Page 15: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 15

Message-passing platforms

• The most fundamental communication primitives in a message-passing machine are synchronous send and receive operations– Here data movement must be specified at both ends of

the communication, this is known as two-sided communication. e.g. MPI_Send and MPI_Recv*

– Non-blocking versions of send and receive are also often provided to allow computation and communication to be overlapped

*Message Passing Interface (MPI) is a portable message-passing system that is supported by a very wide range of parallel machines.

Page 16: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 16

One-side communication

• SHMEM– Provides routines to access the memory of a remote

processing element without any assistance from the remote process, e.g:

• shmem_put (target_addr, source_addr, length, remote_pe)

• shmem_get, shmem_barrier etc.– One-sided communication may be used to reduce

synchronization, simplify programming and reduce data movement

Page 17: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 17

The communication model

• From a hardware perspective we would like to keep the machine simple (message-passing)

• But we inevitably need to simplify the programmer's and compiler's task – Efficiently support shared-memory programming– Add support for transactional memory?– Create a simple but high-performance target

• Trade-offs between hardware complexity and complexity of hardware and compiler.

Page 18: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 18

Today's chip multiprocessors

• Intel Nehalem-EX (2009)– 8-cores

• 2-way hyperthreaded (SMT)

• 16 hardware threads

– L1I 32KB, L1D 32KB– 256 KB L2 (Private)– 24MB L3 (Shared)

• 8-banks• Inclusive L3

Page 19: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 19

Today's chip multiprocessors

L1

L2

Shared L3

Memory

Intel Nahalem-EX (2009)

Page 20: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 20

Today's chip multiprocessors

• IBM Power 7 (2010) – 8 core (dual-chip module to hold 16 cores)– 32MB shared eDRAM L3 cache– 2-channel DDR3 controllers– Individual cores

• 4-thread SMT per core• 6 ops/cycle• 4GHz

Page 21: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 21

Today's chip multiprocessors

IBM Power 7 (2010)

Page 22: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 22

Today's chip multiprocessors

• Sun Niagara T1 (2005)

Each core has its own level 1 cache (16KB for instructions, 8KB for data). The level 2 caches are 3MB in total and are effectively 12-way associative. They are interleaved by 64-byte cache lines.

Page 23: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 23

Oracle M7 Processor (2014)

• 32 core– Dual-issue, OOO

• Dynamic multithreading 1-8 threads/core

• 256KB I&D L2 caches shared by groups of 4 cores

• 64MB L3• Technology: 20nm, 13 metal

layers• 16 DDR channels

– 160GB/s – (vs. ~20GB/s for T1)

• >10B transistors!

Page 24: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 24

“Manycore” designs: Tilera

• Tilera (now Mellanox)– Evolution of MIT RAW– 100-cores– grid of identical tiles– Low-power 3-way VLIW

cores– Cores interconnected

by a selection of static and dynamic on-chip networks

Page 25: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 25

“Manycore” designs: Celerity (2017)

Tiered Accelerator FabricGeneral-purpose tier: 5 “Rocket” RISC-V coresMassively parallel tier: 496 5-stage RISC-V cores, 16x31 tiled mesh arraySpecialised tier: Binarized Neural Network accelerator

Page 26: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 26

GPUs

“The NVIDIA GeForce 8800 GPU”, Hot Chips 2007

• TESLA P100– 56 Streaming

multiprocessors x 64 cores = 3584 “cores” or lanes

– 732GB/s memory bandwidth

– 4MB L2 cache– 15.3 billion transistors

Page 27: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 27

Communication latencies

• Chip multiprocessor– Some have very fast core to core communication, as

low as 1-3 cycles– Opportunities to add dedicated core-to-core links– Typical L1-to-L1 communication latencies may be

around 10-100 cycles• Other types of parallel machine:

– Shared memory multiprocessor ~500– Cluster/supercomputer ~5000-10000

Page 28: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 28

Approaches to parallel programming

• “Principles of Parallel Programming”, Calvin Lin and Lawrence Snyder, Pearson, 2009

• This book provides a good overview of the different approaches to parallel programming

• There is also a significant amount of information on the course wiki – Try some examples!

Page 29: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 29

Approaches to parallel programming

• Programming with threads and shared memory • Message-passing libraries• PGAS languages• High level parallel languages

Page 30: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 30

Threads and shared memory

• A thread, or thread of execution, is a unit of parallelism– It consists of everything necessary to execute a

sequential stream of instructions• program code, a call stack, set of registers (incl. a single

program counter)

– It shares memory with other threads• Threads cooperate and coordinate there actions by

reading and writing to shared variables– Special atomic operations are provided by the

multiprocessor for synchronization

Page 31: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 31

Threads and shared memory

• How might we express threads in our code?• fork/join

– Fork/Join keywords can appear anywhere in code– General, but unstructured

p1; start p5 in ||fork(p5)p2fork(p3)P4; wait for p5 to ; completejoin(p5)p6join(p3)p7

A forked procedure runs in parallel with main thread

Page 32: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 32

Threads and shared memory

• fork/join using the pthreads library– Limitations to bare metal thread programming?

void *thread_func ( void *ptr){  int i = ((thread_args *) ptr)­>input;  ((thread_args *) ptr)­>output = fib(i);  return NULL;}

args.input=n­1;

// create and start first threadstatus = pthread_create(&thread, NULL, thread_func, (void*)&args );    // calc. fib(n­2) in parallelresult = fib (n­2);  // joinpthread_join(thread, NULL);

Page 33: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 33

Threads and shared memory

• parbegin/parend (cobegin/coend)• Simple and structured, but not as general as fork/join,

e.g. we cannot represent the graph on the previous slide.

p1parbegin  p5  begin    p2    parbegin      p3      p4    parend  endparendp6p7

Page 34: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 34

Threads and shared memory

• Even though parbegin..parend can only represent properly nested dependency graphs it is usually adequate

• Cilk style spawn/sync

cilk int fib (int n){  if (n < 2) return n;  else  {    int x, y;        x = spawn fib (n­1);    y = spawn fib (n­2);       sync;       return (x+y);   }}

spawn – indicates that the proceduce call can safely proceed in parallelsync – wait until all previously spawned procedures have returned their results

Page 35: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 35

Threads and shared memory

• forall (doall, parfor)– Simply allows a programmer to indicate that each

iteration of the loop is independent and may be run in parallel

– OpenMP example:

#pragma omp parallel forfor (i=first; i<n; i+=prime)

marked[i]=1;

Page 36: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 36

Threads and shared memory

• Futures– Future <expr>

• Evaluate the expression concurrently with calling program. An asynchronous function call

• If a thread requires the value of a future that has not been computed, stall the thread until it is available

“The incremental garbage collection of processes”, Baker/Hewitt, 1977

y=future (fn(x))..........z=y+1;

Page 37: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 37

Threads and shared memory

• Synchronization and coordination– In addition to creating threads, we also need to be able

to control the way threads interact. – Often involves identifying critical sections

• Mechanisms– Locks and barriers– Mutexes and monitors– Condition Variables (wait/signal)– Transactional memory

• See reading group papers and examples

Page 38: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 38

Message-passing

• Simple (perhaps primitive) programming model– Programmer must distribute and explicitly move data– The fact that the interactions are explicit can be seen

as both an advantage and a disadvantage• Potentially simple hardware implementation• Processes communicate and synchronize by sending

messages– Message Passing Interface (MPI) standard

• Widely used on High-Performance Computing (HPC) platforms

• Programs tend to be portable• Usually written in a Single-Program Multiple-Data (SPMD)

style

Page 39: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 39

PGAS languages

• Partitioned Global Address Space Languages– Aimed at large-scale distributed memory machines

• Aim to improve on MPI

– PGAS languages overlay a global address space on the virtual memories of the distributed machines

• No expectation that memories will be coherent• The programmer distinguishes between local and non-local

data• The compiler generates the necessary communication calls

in response to non-local references• Compiler exploits one-sided communication primitives

rather than message-passing

• Co-Array Fortran, Unified Parallel C, Titanium (Ti) (Titanium extends Java)

Page 40: 2 • Introduction to parallel computingrdm34/acs-slides/lec2.pdf · 2 • Introduction to parallel computing Chip Multiprocessors (ACS MPhil) Robert Mullins. Chip Multiprocessors

Chip Multiprocessors (ACS MPhil) 40

High-level parallel languages

• Global view of computation– Raise level of abstraction

• Hide low-level details of communication and synchronization• Take a global view and describe the algorithm rather than

per-task behavior• e.g. ZPL forces programmer to think in parallel style using

array operations (reference to neighboring elements, flood, remap, reduction, ...)

• Compiler, runtime and libraries will manage implementation details

– Interesting examples: • ZPL – Array programming language• NESL, Data Parallel Haskell (see wiki)• See also Cray Chapel, IBM X10, Sun Fortress languages

(DARPA HPCS project)