42
PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5) Rob van Nieuwpoort [email protected] Rob van Nieuwpoort [email protected]

PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

PARALLEL PROGRAMMING

MANY-CORE COMPUTING:

HARDWARE (2/5)

Rob van Nieuwpoort

[email protected]

Rob van Nieuwpoort

[email protected]

Page 2: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Schedule 2

1. Introduction, performance metrics & analysis

2. Many-core hardware, low-level optimizations

3. Cuda class 1: basics

4. Cuda class 2: advanced

5. Case study: LOFAR telescope with many-cores

Page 3: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Multi-core CPUs 3

Page 4: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

General Purpose Processors 4

Architecture

Few fat cores

Vectorization Streaming SIMD Extensions (SSE)

Advanced Vector Extensions (AVX)

Homogeneous

Stand-alone

Memory

Shared, multi-layered

Per-core cache and shared cache

Programming

Multi-threading

OS Scheduler

Coarse-grained parallelism

Page 5: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Intel 5

Page 6: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

AMD Magny-Cours

Two 6-core processors on a single chip

Up to four of these chips in a single compute node

48 cores in total

Non-uniform memory access (NUMA)

Per-core cache

Per-chip cache

Local memory

Remote memory (hypertransport)

6

Page 7: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

AMD Magny-Cours 7

Page 8: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

AMD Magny-Cours 8

Page 9: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

AWARI on the Magny-Cours 9

DAS-2 (1999)

51 hours

72 machines / 144 cores

72 GB RAM in total

1.4 TB disk in total

Magny-Cours (2011)

45 hours

1 machine, 48 cores

128 GB RAM in 1 machine

4.5 TB disk in 1 machine

Less than 12 hours with new algorithm (needs more RAM)

Page 10: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Multi-core CPU programming

Threads

Pthreads

Java threads

OpenMP

Message passing (MPI)

Vectorization

MMX, SSE, AVX, AltiVec, …

OpenCL

Supports threads and vectors

10

Page 11: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Vectorization on x86 architectures 11

Since Name Bits Single

precision

vector size

Double precision

vector size

1996 MultiMedia eXtensions (MMX) 64 bit Integer only Integer only

1999 Streaming SIMD Extensions (SSE) 128 bit 4 float 2 double

2011 Advanced Vector Extensions (AVX) 256 bit 8 float 4 double

2012 Intel Xeon Phi accelerator

(was Larrabee, MIC)

512 bit 16 float 8 double

Page 12: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Vectorizing with SSE

Assembly instructions

16 vector registers

C or C++: intrinsics

Declare vector variables

Name instruction

Work on variables, not registers

12

Page 13: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Vectorizing with SSE examples

float data[1024];

// init: data[0] = 0.0, data[1] = 1.0, data[2] = 2.0, etc.

init(data);

// Set all elements in my vector to zero.

__m128 myVector0 = _mm_setzero_ps();

// Load the first 4 elements of the array into my vector.

__m128 myVector1 = _mm_load_ps(data);

// Load the second 4 elements of the array into my vector.

__m128 myVector2 = _mm_load_ps(data+4);

0.0

0 element

value

1 2 3

0.0 0.0 0.0

0.0

0 element

value

1 2 3

3.0 2.0 1.0

4.0

0 element

value

1 2 3

7.0 6.0 5.0

13

Page 14: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Vectorizing with SSE examples

// Add vectors 1 and 2; instruction performs 4 FLOPs.

__m128 myVector3 = _mm_add_ps(myVector1, myVector2);

// Multiply vectors 1 and 2; instruction performs 4 FLOPs.

__m128 myVector4 = _mm_mul_ps(myVector1, myVector2);

// _MM_SHUFFLE(w,x,y,z) selects w&x from vec1 and y&z from vec2.

__m128 myVector5 = _mm_shuffle_ps(myVector1, myVector2,

_MM_SHUFFLE(2, 3, 0, 1));

0 element

value

1 2 3

4.0 = + 6.0 8.0 10.0

0 element

value

1 2 3

0.0 1.0 2.0 3.0

0 element

value

1 2 3

4.0 5.0 6.0 7.0

0 element

value

1 2 3

0.0 = x 5.0 12.0 21.0

0 element

value

1 2 3

2.0 = 3.0 4.0 5.0 s

0 element

value

1 2 3

0.0 1.0 2.0 3.0

0 element

value

1 2 3

4.0 5.0 6.0 7.0

0 element

value

1 2 3

0.0 1.0 2.0 3.0

0 element

value

1 2 3

4.0 5.0 6.0 7.0

14

Page 15: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Vector add

void vectorAdd(int size, float* a, float* b, float* c) {

for(int i=0; i<size; i++) {

c[i] = a[i] + b[i];

}

}

15

Page 16: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Vector add with SSE: unroll loop

void vectorAdd(int size, float* a, float* b, float* c) {

for(int i=0; i<size; i += 4) {

c[i+0] = a[i+0] + b[i+0];

c[i+1] = a[i+1] + b[i+1];

c[i+2] = a[i+2] + b[i+2];

c[i+3] = a[i+3] + b[i+3];

}

}

16

Page 17: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Vector add with SSE: vectorize loop

void vectorAdd(int size, float* a, float* b, float* c) {

for(int i=0; i<size; i += 4) {

__m128 vecA = _mm_load_ps(a + i); // load 4 elts from a

__m128 vecB = _mm_load_ps(b + i); // load 4 elts from b

__m128 vecC = _mm_add_ps(vecA, vecB); // add four elts

_mm_store_ps(c + i, vecC); // store four elts

}

}

17

Page 18: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

The Cell Broadband Engine 18

Page 19: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Cell/B.E. 19

Architecture

Heterogeneous

1 PowerPC (PPE)

8 vector-processors (SPEs)

Programming

User-controlled scheduling

6 levels of parallelism, all under user control

Fine- and coarse-grain parallelism

Page 20: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Cell/B.E. memory 20

“Normal” main memory

PPE: normal read / write

SPEs: Asynchronous manual transfers

Direct Memory Access (DMA)

Per-core fast memory: the Local Store (LS)

Application-managed cache

256 KB

128 x 128 bit vector registers

Page 21: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Cell/B.E. 21

Page 22: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Roadrunner (IBM) 22

Los Alamos National Laboratory

#1 of top500 June 2008 – November 2009

Now #19

122,400 cores, 1.4 petaflops

First petaflop system

PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz

Page 23: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

The Cell’s vector instructions

Differences with SSE

SPEs execute only vector instructions

More advanced shuffling

Not 16, but 128 registers!

Fused Multiply Add support

23

Page 24: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

FMA instruction

A B

Product

C

D =

( truncate digits )

A B

Product

C

D =

+

×

= (retain all digits)

×

=

+

(no loss of precision)

Multiply-Add (MAD): D = A * B + C

Fused Multiply-Add (FMA): D = A * B + C

24

Page 25: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Cell Programming models

IBM Cell SDK

C + MPI

OpenCL

Many models from academia...

25

Page 26: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Cell SDK

Threads, but only on the PPE

Distributed memory

Local stores = application-managed cache!

DMA transfers

Signaling and mailboxes

Vectorization

26

Page 27: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Direct Memory Access (DMA)

Start asynchronous DMA mfc_get (local store space, main mem address, #bytes, tag);

Wait for DMA to finish mfc_write_tag_mask(tag);

mfc_read_tag_status_all();

DMA lists

Overlap communication with useful work

Double buffering

27

Page 28: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Vector sum

float vectorSum(int size, float* vector) {

float result = 0.0;

for(int i=0; i<size; i++) {

result += vector[i];

}

return result;

}

28

Page 29: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Parallelization strategy

Partition problem into 8 pieces

(Assuming a chunk fits in the Local Store)

PPE starts 8 SPE threads

Each SPE processes 1 piece

Has to load data from PPE with DMA

PPE adds the 8 sub-results

29

Page 30: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Vector sum SPE code (1)

float vectorSum(int size, float* PPEVector) {

float result = 0.0;

int chunkSize = size / NR_SPES; // Partition the data.

float localBuffer[chunkSize]; // Allocate a buffer in

// my private local store.

int tag = 42;

// Points to my chunk in PPE memory.

float* myRemoteChunk = PPEVector + chunkSize * MY_SPE_NUMBER;

30

Page 31: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Vector sum SPE code (2)

// Copy the input data from the PPE.

mfc_get(localBuffer, myRemoteChunk, chunkSize*4, tag);

mfc_write_tag_mask(tag);

mfc_read_tag_status_all();

// The real work.

for(int i=0; i<chunkSize; i++) {

result += localBuffer[i];

}

return result;

}

31

Page 32: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Can we optimize this strategy? 32

Page 33: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Can we optimize this strategy? 33

Vectorization

Overlap communication and computation

Double buffering

Strategy:

Split in more chunks than SPEs

Let each SPE download the next chunk while processing the

current chunk

Page 34: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

DMA double buffering example (1)

float vectorSum(float* PPEVector, int size, int nrChunks) {

float result = 0.0;

int chunkSize = size / nrChunks;

int chunksPerSPE = nrChunks / NR_SPES;

int firstChunk = MY_SPE_NUMBER * chunksPerSPE;

int lastChunk = firstChunk + nrChunks;

// Allocate two buffers in my private local store.

float localBuffer[2][chunkSize];

int currentBuffer = 0;

// Start asynchronous DMA of first chunk.

float* myRemoteChunk = PPEVector + firstChunk * chunkSize;

mfc_get(localBuffer[currentBuffer], myRemoteChunk, chunkSize,

currentBuffer);

34

Page 35: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

DMA double buffering example (2)

for (int chunk = firstChunk; chunk < lastChunk; chunk++) {

// Prefetch next chunk asynchronously.

if(chunk != lastChunk - 1) {

float* nextRemoteChunk = PPEVector + (chunk+1) * chunkSize;

mfc_get(localBuffer[!currentBuffer], nextRemoteChunk,

chunkSize, !currentBuffer);

}

// Wait for of current buffer DMA to finish.

mfc_write_tag_mask(currentBuffer); mfc_read_tag_status_all();

// The real work.

for(int i=0; i<chunkSize; i++)

result += localBuffer[currentBuffer][i];

currentBuffer = !currentBuffer;

}

return result;

}

35

Page 36: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Double and triple buffering

Read-only data

Double buffering

Read-write data

Triple buffering!

Work buffer

Prefetch buffer, asynchronous download

Finished buffer, asynchronous upload

General technique

On-chip networks

GPUs (PCI-e)

MPI (cluster)

36

Page 37: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Intel’s many-core platforms 37

Page 38: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Intel Single-chip Cloud Computer 38

Architecture

Tile-based many-core (48 cores)

A tile is a dual-core

Stand-alone

Memory

Per-core and per-tile

Shared off-chip

Programming

Multi-processing with message passing

User-controlled mapping/scheduling

Gain performance …

Coarse-grain parallelism

Multi-application workloads (cluster-like)

Page 39: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Intel Single-chip Cloud Computer 39

Page 40: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Intel SCC Tile

2 cores

16K L1 cache per core

256K L2 per core

8K Message passing buffer

On-chip network router

40

Page 41: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Intel's Larrabee 41

GPU based on x86 architecture

Hardware multithreading

Wide SIMD

Achieved 1 tflop sustained application performance (SC09)

Canceled in Dec 2009, re-targeted to HPC market

Page 42: PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE …bal/college12/class2-hardware.pdf · Many-core hardware, low-level optimizations 3. Cuda class 1: basics 4. Cuda class 2: advanced

Intel Xeon Phi 42

Larrabee + 80-core research chip + SCC → MIC architecture

Brand name now Xeon Phi

First product: Knights corner

GPU-like accelerator

60+ pentium1-like cores

512-bit SIMD

At least 8GB of GDDR5

1 teraflop double precision

Programming is x86 compatible

OpenMP, OpenCL, Cilk, parallel libraries