MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides:

MANY-CORE

COMPUTING Ana Lucia Varbanescu, UvA

Original slides: Rob van Nieuwpoort, eScience Center 6-Oct-2014

Schedule

1. Introduction and Programming Basics (2-10-

2014)

2. Performance analysis (6-10-2014)

3. Advanced CUDA programming (9-10-2014)

4. Case study: LOFAR telescope with many-

cores

by Rob van Nieuwpoort (??)

2

GPUs @ AMD 3

Radeon R9 Top of the line: R9 295X2

For comparison: R9 290X Performance: 5.6 TFLOPs

Memory: 4GB Bandwidth: 320GB/s

NVIDIA GTX980 (Maxwell) Performance: 5.0 TFLOPs

Memory: 4GB Lower bandwidth: 224 GB/s

NVIDIA GTX Titan Black (Kepler) Performance: 5.3 TFLOPs

Memory: 6GB Higher bandwidth: 336 GB/s

NVIDIA GTX Titan Z vs R9 295X2: fairly similar numbers, higher DP performance

Today 4

Revisit the VectorAdd

For GPUs

For many-core CPUs

Hardware revisited

Performance analysis

Hardware performance

Application performance

VectorAdd revisited 5

Vector add: sequential 6

void vector_add(int size, float* a, float* b, float* c) {

for(int i=0; i<size; i++) {

c[i] = a[i] + b[i];

}

}

Vector add: GPU code (skeleton) 7

// compute vector sum c = a + b

// each thread performs one pair-wise addition

__global__ void vector_add(int N,float* A,float* B,float* C){

int i = threadIdx.x + blockDim.x * blockIdx.x;

if (i<N) C[i] = A[i] + B[i];

}

int main() {

// initialization code here ...

N = 5120;

// launch N/256 blocks of 256 threads each

vector_add<<< N/256, 256 >>>(deviceA, deviceB, deviceC);

// cleanup code here ...

}

Device code

Host code

(should be in the same file)

Multi-core CPU programming

Two levels of parallelism:

Coarse-grain: threads / processes

Fine-grain: SIMD operations

Instantiate the threads

Pthreads

Java threads

OpenMP

MPI

Vectorize

Rely on compilers

Manual vectorization

Vector types

Intrinsics

8

OpenMP 9

Add directives to sequential code for parallel

sections.

// Phi function to add two vectors

vector_add_Phi(int n, int* a, int* b, int* c) {

int i = 0;

#pragma omp parallel for

for (i = 0; i < n; i++)

c[i] = a[i] + b[i];

}

// main program

int main() {

int i, in1[SIZE], in2[SIZE], res[SIZE];

{

vector_add_Phi(SIZE, in1, in2, res);

}}

OpenMP (for Xeon Phi, too) 10

Add directives to sequential code for parallel

sections.

// Phi function to add two vectors

__attribute__((target(mic)))

vector_add_Phi(int n, int* a, int* b, int* c) {

int i = 0;

#pragma omp parallel for

for (i = 0; i < n; i++)

c[i] = a[i] + b[i];

}

// main program

int main() {

int i, in1[SIZE], in2[SIZE], res[SIZE];

#pragma offload target(mic) in(in1,in2) inout(res)

{

vector_add_Phi(SIZE, in1, in2, res);

}}

For Xeon Phi

For Xeon Phi

Cilk (for Xeon Phi, too) 11

Add directives to parallelize sequential code

by divide-and-conquer

cilk VectorAdd(float *a, float *b, float *c, int n){

if (n<GrainSize) {

int i;

for(i=0; i<n; ++i)

a[i] = b[i]+c[i];

}

else {

spawn (a,b,c,n/2);

spawn (a+n/2,b+n/2,c+n/2,n/2);

}

}

Vectorization on x86 architectures 12

Sinc

e

Name Bits Single

precision

vector size

Double

precision

vector size

1996 MultiMedia eXtensions (MMX) 64 bit Integer only Integer only

1999 Streaming SIMD Extensions

(SSE)

128 bit 4 float 2 double

2011 Advanced Vector Extensions

(AVX)


2012 Intel Xeon Phi accelerator

(was Larrabee, MIC)


Vectorizing with SSE

Assembly instructions

Execute on vector registers

C or C++: intrinsics

Declare vector variables

Name instruction

Work on variables, not registers

13

Vectorizing with SSE examples

float data[1024];

// init: data[0] = 0.0, data[1] = 1.0, data[2] = 2.0, etc.

init(data);

// Set all elements in my vector to zero.

__m128 myVector0 = _mm_setzero_ps();

// Load the first 4 elements of the array into my vector.

__m128 myVector1 = _mm_load_ps(data);

// Load the second 4 elements of the array into my vector.

__m128 myVector2 = _mm_load_ps(data+4);

0.0

0 element

value

1 2 3

0.0 0.0 0.0

0.0

0 element

value

1 2 3

3.0 2.0 1.0

4.0

0 element

value

1 2 3

7.0 6.0 5.0

14

Vectorizing with SSE examples

// Add vectors 1 and 2; instruction performs 4 FLOP.

__m128 myVector3 = _mm_add_ps(myVector1, myVector2);

// Multiply vectors 1 and 2; instruction performs 4 FLOP.

__m128 myVector4 = _mm_mul_ps(myVector1, myVector2);

// _MM_SHUFFLE(w,x,y,z) selects w&x from vec1 and y&z from vec2.

__m128 myVector5 = _mm_shuffle_ps(myVector1, myVector2,

_MM_SHUFFLE(2, 3, 0, 1));

0 element

value

1 2 3

4.0 = + 6.0 8.0 10.0

0 element

value

1 2 3

0.0 1.0 2.0 3.0

0 element

value

1 2 3

4.0 5.0 6.0 7.0

0 element

value

1 2 3

0.0 = x 5.0 12.0 21.0

0 element

value

1 2 3

2.0 = 3.0 4.0 5.0 s

0 element

value

1 2 3

0.0 1.0 2.0 3.0

0 element

value

1 2 3

4.0 5.0 6.0 7.0

0 element

value

1 2 3

0.0 1.0 2.0 3.0

0 element

value

1 2 3

4.0 5.0 6.0 7.0

15

Vector add with SSE: unroll loop

void vectorAdd(int size, float* a, float* b, float* c) {

for(int i=0; i<size; i += 4) {

c[i+0] = a[i+0] + b[i+0];

c[i+1] = a[i+1] + b[i+1];

c[i+2] = a[i+2] + b[i+2];

c[i+3] = a[i+3] + b[i+3];

}

}

16

Vector add with SSE: vectorize

loop

void vectorAdd(int size, float* a, float* b, float* c) {

for(int i=0; i<size; i += 4) {

__m128 vecA = _mm_load_ps(a + i); // load 4 elts from a

__m128 vecB = _mm_load_ps(b + i); // load 4 elts from b

__m128 vecC = _mm_add_ps(vecA, vecB); // add four elts

_mm_store_ps(c + i, vecC); // store four elts

}

}

17

Optional assignment 18

Implement a vectorized version of

Element-wise array multiplication, with complex

numbers

Element-wise array division, with complex numbers

Compile with gcc and measure performance

with/without vectorization.

Send (pseudo-)code (and performance numbers,

if you have them) by email to

[email protected]

CPUs

NVIDIA GPUs

Hardware revisited 19

Generic multi-core CPU 20

Hardware threads

SIMD units (vector lanes)

L1 and L2

dedicated

caches

Shared L3/L4 cache Main memory, I/O

Peak

performance

Bandwidth

Generic GPU 21

Single or SIMD execution units Hardware scheduler

Local memory/cache Units for executing

functions with high precision

Peak

performance

Bandwidth

NVIDIA GPUs 22

Kepler: Larger SM (SMX), more registers, better scheduler, dynamic parallelism, multi-GPU

Maxwell: Modular SM (SMM), dedicated registers, dedicated schedulers, more L2 cache

Platform architecture (Fermi) 23

Memory architecture (from Fermi) 24

Configurable L1 cache per SM

16KB L1 cache / 48KB Shared

48KB L1 cache / 16KB Shared

Shared L2 cache

Device memory

L2 cache

Host memory

PCI-e

bus

registers

L1 cache /

shared mem

registers

L1 cache /

shared mem ….

Fermi 25

L2 Cache

Mem

ory C

ontrolle

r

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Mem

ory C

ontrolle

rM

em

ory C

ontrolle

r

Mem

ory C

ontroller

Mem

ory C

ontroller

Mem

ory C

ontroller

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

Polymorph Engine

Host Interface

GigaThread Engine

L2 Cache

Mem

ory C

ontrolle

r

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Mem

ory C

ontrolle

rM

em

ory C

ontrolle

r

Mem

ory C

ontroller

Mem

ory C

ontroller

Mem

ory C

ontroller

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

Polymorph Engine

Host Interface

GigaThread Engine Consumer: GTX 480, 580

HPC: Tesla C2050

More memory, ECC

1.0 TFlop SP

515 Gflop DP

16 streaming multiprocessors (SM) GTX 580: 16

GTX 480: 15

C2050: 14

768 KB L2 cache

Fermi : SM 26

L2 Cache

Mem

ory C

ontrolle

r

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Mem

ory C

ontrolle

rM

em

ory C

ontrolle

r

Mem

ory C

ontroller

Mem

ory C

ontroller

Mem

ory C

ontroller

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

Polymorph Engine

Host Interface

GigaThread Engine

L2 Cache

Mem

ory C

ontrolle

r

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Mem

ory C

ontrolle

rM

em

ory C

ontrolle

r

Mem

ory C

ontroller

Mem

ory C

ontroller

Mem

ory C

ontroller

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

Polymorph Engine

Host Interface

GigaThread Engine

32 cores per SM

64KB configurable

L1 cache / shared memory

32,768 32-bit registers

Fermi: CUDA Core* 27

Decoupled floating point and integer data paths

Double Fused-multiply-add (FMA)

Integer operations optimized for extended precision

DP throughput is 50% of SP throughput

DP: 256 FMA ops /clock

SP: 512 FMA ops /clock

*http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf

Kepler: the new SMX

Consumer:

GTX680, GTX780, GTX-Titan

HPC

Tesla K10..K40

SMX features

192 CUDA cores

32 in Fermi

32 Special Function Units (SFU)

4 for Fermi

32 Load/Store units (LD/ST)

16 for Fermi

3x Perf/Watt improvement

28

A comparison 29

Maxwell: the newest SMM

Consumer:

GTX 970, GTX 980, …

HPC:

?

SMM Features:

4 subblocks of 32 cores

Dedicated L1/LM per 64 cores

Dispatch/ecode/registers per 32 cores

L2 cache: 2MB (~3x vs. Kepler)

40 texture units

Lower power consumption

30

Hardware performance 31

Hardware Performance metrics

Clock frequency [GHz] = absolute hardware speed

Memories, CPUs, interconnects

Operational speed [GFLOPs]

Instructions per cycle + frequency

Memory bandwidth [GB/s]

differs a lot between different memories on chip

Power [Watt]

Derived metrics

FLOP/Byte, FLOP/Watt

32

Theoretical peak performance

Peak = chips * cores * threads/core * vector_lanes *

FLOPs/cycle * clockFrequency

Examples from DAS-4:

Intel Core i7 CPU

2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle * 2.4 GHz = 154 GFLOPs

NVIDIA GTX 580 GPU

1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle * 1.544 GhZ = 1581 GFLOPs

ATI HD 6970

1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle

* 0.880 GHz = 2703 GFLOPs

33

DRAM Memory bandwidth

Throughput =

memory bus frequency * bits per cycle * bus

width

Memory clock != CPU clock!

In bits, divide by 8 for GB/s

Examples:

Intel Core i7 DDR3: 1.333 * 2 * 64 = 21 GB/s

NVIDIA GTX 580 GDDR5: 1.002 * 4 * 384 = 192 GB/s

ATI HD 6970 GDDR5: 1.375 * 4 * 256 = 176

GB/s

34

Memory bandwidths

On-chip memory can be orders of magnitude faster

Registers, shared memory, caches, …

E.g., AMD HD 7970 L1 cache achieves 2 TB/s

Off-chip memories: depends on the interconnect

Intel’s technology: QPI (Quick Path Interconnect)

25.6 GB/s

AMD’s technology: HT3 (Hyper Transport 3) 19.2 GB/s

Accelerators: PCI-e 2.0 8 GB/s

35

Power

Chip manufactures specify Thermal Design Power

(TDP)

We can measure dissipated power

Whole system

Typically (much) lower than TDP

Power efficiency

FLOPS / Watt

Examples (with theoretical peak and TDP)

Intel Core i7: 154 / 160 = 1.0 GFLOPs/W

NVIDIA GTX 580: 1581 / 244 = 6.3 GFLOPs/W

ATI HD 6970: 2703 / 250 = 10.8 GFLOPs/W

36

Summary

Cores Threads/ALUs GFLOPS Bandwidth

Sun Niagara 2 8 64 11.2 76

IBM BG/P 4 8 13.6 13.6

IBM Power 7 8 32 265 68

Intel Core i7 4 16 85 25.6

AMD Barcelona 4 8 37 21.4

AMD Istanbul 6 6 62.4 25.6

AMD Magny-Cours 12 12 125 25.6

Cell/B.E. 8 8 205 25.6

NVIDIA GTX 580 16 512 1581 192

NVIDIA GTX 680 8 1536 3090 192

AMD HD 6970 384 1536 2703 176

AMD HD 7970 32 2048 3789 264

Intel Xeon Phi 7120 61 240 2417 352

Absolute hardware performance

Only achieved in the optimal conditions:

Processing units 100% used

All parallelism 100% exploited

All data transfers at maximum bandwidth

In real life, there are no applications like this

Can we reason about “real” performance?

38

Optional assignment 39

Compute and fill in the numbers in the table

with the CPU and GPU from your machine.

Compute the FLOPs/BW as well

Compute the numbers and fill in the table for

your dream GPU

Please send me your answers (just the added

lines) by Thursday @ 11:00 at

[email protected]

Amdahl’s Law

Operational Intensity and the Roofline

model

Performance analysis 40

Software performance metrics (3

P’s)

Performance

Execution time

Speed-up

Computational throughput (GFLOP/s)

Computational efficiency (i.e., utilization)

Bandwidth (GB/s)

Memory efficiency (i.e., utilization)

Productivity and Portability

Programmability

Production costs

Maintenance costs

41

Reason early about performance 42

Amdahl’s law:

s = fraction of sequential code

p = number of processors

Parallel part: assumed perfectly parallel!

How fast can it really be?

Compute achievable performance

Amdhal’s Law in pictures

RGB to gray

for (int y = 0; y < height; y++) {

for (int x = 0; x < width; x++) {

Pixel pixel = RGB[y][x];

gray[y][x] =

0.30 * pixel.R

+ 0.59 * pixel.G

+ 0.11 * pixel.B;

}

}

45

Performance evaluation 46

Measure execution time : Tpar

Absolute performance

Calculate speed-up : S = Tseq / Tpar

Relative performance

Does not take application into account!

Execution time and speedup can be used to

compare implementations of the same

algorithm!

Performance measurement setup

Image sizes:

Select at least 7 different images

Order them increasingly

Run the code 10 times per image

Assume outliers are eliminated

Ts = average 10 sequential runs

Choose different p’s:

Tp = average 10 parallel runs

Tp_par = execution time for the parallel part

Tp_seq = execution time for the sequential part (should be the same)

Report execution time & speed-ups

Full application

Parallel section only

An example: execution time

0

5

10

15

20

25

30

35

Image 1 Image 2 Image 3 Image 4 Image 5 Image 6 Image 7

Ts

T2

T4

T8

T16

Same example: speed-up

0

1

2

3

4

5

6

7

8

2 4 8 16

Image 1

Image 2

Image 3

Image 4

Image 5

Image 6

Image 7

Strong scaling

How would you build a weak scaling

experiment?

Weak scaling: keep the same work per

compute node and increase the number

of compute nodes.

Strong scaling: keep the total workload

constant and increase the number of

cores/nodes.

Derived metrics 50

Throughput: GFLOPs = #FLOPs / Tpar

Takes application into account!

Calculate compute utilization: Ec = GFLOPs/peak *100

Bandwidth: BW = #(RD+WR) / Tpar

Takes application into account!

Calculate bandwidth utilization: Ebw = BW/peak*100

Achieved bandwidth and throughput can be used

to compare *different* algorithms.

Utilization can be used to compare *different*

(application, platform) combinations.

Performance analysis 51

Real-life performance vs. theoretical limits.

Understand bottlenecks

Perform correct optimizations

… decide when to stop fiddling with code!!!

Computing the theoretical limits is the most

difficult challenge in parallel performance

analysis

Use theoretical peak limits => low accuracy

Use application characteristics

Use the platform characteristics

Arithmetic/operational intensity

The number of operations per byte of

accessed memory

Compute-intensive?

Data-intensive?

It is an application characteristic!

Ignore “overheads”

Loop counters

Array index calculations

Branches

52

RGB to gray

for (int y = 0; y < height; y++) {

for (int x = 0; x < width; x++) {

Pixel pixel = RGB[y][x]; // 3-byte structure

gray[y][x] =

0.30 * pixel.R

+ 0.59 * pixel.G

+ 0.11 * pixel.B;

}

}

53

2 x ADD, 3 x MUL = 5 Ops

1 x RD, 1 x WR => 4 bytes of memory accessed

OI = 5/4 = 1.25

Many-core platforms

Cores

Threads or ALUs GFLOPS Bandwidth

FLOPs/Byte

Sun Niagara 2 8 64 11.2 76 0.1

IBM bg/p 4 8 13.6 13.6 1.0

IBM Power 7 8 32 265 68 3.9

Intel Core i7 4 16 85 25.6 3.3

AMD Barcelona 4 8 37 21.4 1.7

AMD Istanbul 6 6 62.4 25.6 2.4

AMD Magny-Cours 12 12 125 25.6 4.9

Cell/B.E. 8 8 205 25.6 8.0

NVIDIA GTX 580 16 512 1581 192 8.2

NVIDIA GTX 680 8 1536 3090 192 16.1

AMD HD 6970 384 1536 2703 176 15.4

AMD HD 7970 32 2048 3789 264 14.4

Intel Xeon Phi 7120 61 240 2417 352 6.9

Compute or memory intensive?

RGB to Gray

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Sun Niagara 2IBM bg/p

IBM Power 7Intel Core i7

AMD BarcelonaAMD Istanbul

AMD Magny-CoursCell/B.E.

NVIDIA GTX 580NVIDIA GTX 680

AMD HD 6970Intel Xeon Phi 7120Intel Xeon Phi 3120

55

“A multi-/many-core processor is a

device built to turn a compute-intensive

application into a memory-intensive

one”

Kathy Yelick, UC Berkeley

Applications OI

Operational Intensity

O( N ) O( log(N) )

O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTs

Dense Linear Algebra

(BLAS3)

Particle Methods

56

Attainable GFlops/sec

= min(Peak Floating-Point Performance,

Peak Memory Bandwidth * Operational

Intensity)

Peak iff OIapp ≥ PeakFLOPs/PeakBW

Compute-intensive iff OIapp ≥ (FLOPs/Byte)platform

Memory-intensive iff OIapp < (FLOPs/Byte)platform

Attainable performance 58

Compute intensive

Memory intensive

Attainable GFlops/sec

= min(Peak Floating-Point Performance,

Peak Memory Bandwidth * Operational Intensity)

Example: RGB-to-Gray

AI = 1.25

NVIDIA GTX680 P = min ( 3090, 1.25 * 192) = 240 GFLOPs

Only 7.8% of the peak

Intel Xeon Phi P = min ( 2417, 1.25 * 352) = 440 GFLOPs

Only 18.2% of the peak

Attainable performance 59

Compute intensive

Memory intensive

The Roofline model

AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

60

Roofline: comparing architectures

AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17 AMD Opteron X4: 73.6 gflops, 15 GB/s, ops/byte = 4.9

61

Roofline: computational ceilings


62

Roofline: bandwidth ceilings


63

Roofline: optimization regions 64

Use the Roofline model

Determine what to do first to gain performance

Increase memory streaming rate

Apply in-core optimizations

Increase arithmetic intensity

Reader

Samuel Williams, Andrew Waterman, David

Patterson

“Roofline: an insightful visual performance model

for multicore architectures”

65

Questions? Comments? 66

For questions, comments, suggestions, … :

[email protected]

Documents

MANY-CORE COMPUTING - Vrije Universiteit Amsterdam › ~bal › college14 › class2-2k14.pdf · 2014-10-06 · MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA 6-Oct-2014 Original slides: