31
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model

High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

High Performance Computing and GPU Programming

Lecture 1: Introduction

Objectives

C++/CPU Review

GPU Intro

Programming Model

Page 2: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

Objectives

Page 3: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

• Before we begin…a little motivation

Intel Xeon 2.67GHz 8 Cores 0.85 secs/iteration

Tesla C1070 240 CUDA cores 0.057 secs/iteration

Objectives

Page 4: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

Intel Xeon 2.67GHz Tesla C2070

4 CPUs – 32 Cores 4 GPUs – 1,792 Cores

~145 days ~6 days

Objectives

Page 5: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

Objectives

• I have three goals

– Develop good programming skills and practices

– Understand GPU programming and architecture

– Achieve peak performance through code optimization

Page 6: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

Objectives

• To reach these goals you need

– Understanding of pointers and computer memory

– Knowledge of some computer language • C, C++, FORTRAN – I prefer C++

– Patience • Learning GPU computing can be VERY frustrating

Page 7: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

C++/CPU Review

Page 8: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

• How does computer memory work?

C++/CPU Review

CPU Register

CPU Cache

Level 1

Level 2

Random Access Memory

Physical RAM Virtual RAM

Storage Device Types

External Storage Sources

Temporary Storage

Permanent Storage

Page 9: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

• Why is this important?

– Your computer takes time to access data

– Your computer takes time to operate of data

C++/CPU Review

Intel Processor

Instruction Latency (clocks)

FABS 3

FADD 6

FSUB 6

FMULT 8

FDIV (S) 30

FDIV (D) 44

Page 10: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

• Processor stalling

– Idling occurs when a processor cannot execute the next instruction due to dependency

– CPU’s try to “hide” latency in caches • Cache – store subset of data closer to processing elements for fast

access

• Latency – you are hungry, place an order, you are served 30 mins later. The latency is 30 minutes.

Intel Nehalem Processor

Memory Latency (clocks)

Main memory ~100

L1 ~4

L2 ~10

L3 ~40

C++/CPU Review

Page 11: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

• Latency depends on where the food is

– If stored in your mini-fridge by your desk – fast access (this is on-chip storage)

– If you want a burger at the hub, you got to go get it (this is off-chip storage)

• Once you have the food

– Bandwidth • How much can you scarf down in 1 second?

• Depends on food organization in storage and what food you have

• Also – How are you eating your food?

C++/CPU Review

Page 12: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

• Very basic C++ code

• Remember – C++ starts at 0 … NOT 1!

– First element in array A is A[0], NOT A[1]

C++/CPU Review

#include <stdio.h> int main () { printf("Hello all!\n"); return 0; }

#include inserts “header” files. They contain declarations/definitions needed by the C++ code

The main() function is always where the program starts

Blocks of code (called “lexical scopes”) are marked by “{“ for beginning and “}” for the end

Returns 0 from this function

Page 13: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

• Pointers

– Why do we need them? • Memory management

• GPU computing REQUIRES them

C++/CPU Review

#define n 16

int main() {

//Define the pointer type

double *a, **b;

//Allocate them

a = new double[n];

b = new double*[n];

for (int i=0; i<n; i++) b[i] = new double[n]

//Operate with them

for (int i=0; i<n; i++) {

a[i] = ...

b[0][i] = ...

b[1][i] = ...

}

}

Page 14: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

• Pointers

– What do they do? • Point to memory location where information can be found

• Allow dynamic allocation – On the fly memory management

• Can freely pass into functions an operate on them

– In the case of large data storage • Array to function

– Pass ALL data into function – Takes time

• Pointer to function

– Passes only memory LOCATION to function

– Much faster!

C++/CPU Review

Page 15: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

• Storage

– 1-D storage of all arrays

– Multi-dimensional arrays are more convenience

– With GPU’s, multi-dimensional arrays are: • More hassle

• Slower

– Conversion is simple

C++/CPU Review

Page 16: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

GPU Intro

Page 17: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

• CPU vs. GPU

GPU Intro

Cache

Control ALU ALU

ALU ALU

DRAM DRAM

CPU SIMD – Single instruction multiple data vector units

GPU SIMT – Single instruction multiple threads

Page 18: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

• CPU vs. GPU

GPU Intro

Floating point operations per second Memory Bandwidth

Page 19: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

• CPU vs. GPU

GPU Intro

GPU – Tesla K20 CPU – Intel I7

Cores Main memory 4 (8)

Memory 5 GB 32 KB L1 cache / core 256 KB L2 cache / core

8 MB L3 shared

Clock Speed 2.6 GHz 3.2 GHz

Bandwidth 208 GB/s 25.6 GB/s

FLOPS 1.17 x 10^12 70 x 10^9

Page 20: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

• Why the difference? – GPU specialized for compute intensive highly parallel

computation – graphics rendering

– More devoted to data processing rather than caching

• For CPU parallelism – Rely on: – Parallel libraries – Convenient

– Compiler – Very hard

• For GPU parallelism – Language extension requiring human interaction

– YOU must generate the parallel executable

GPU Intro

Page 21: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

• Another fact:

– Everything costs $$$

– AFRL supercomputer • > $100,000

• Plus a ridiculous amount of power consumption

– GPU Tesla K20 • ~ $5,000

• Put it in your desktop – with a large power supply

– GPU GTX 660 • ~ $200

• Cheap AND fast – Not a lot to work with however

GPU Intro

Page 22: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

Programming Model

Page 23: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

Programming Model

• Compute Unified Device Architecture

– NVIDIA CUDA GPU Programming

– Not exactly High Performance Computing (HPC) – Shares same aspect • Combine multiple GPUs together – HPC GPU cluster

– Data is executed over many threads in parallel

– Controlled with: • Grid

• Blocks

• Threads

Page 24: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

Programming Model

CUDA Program

Block 0 Block 1 Block 2 Block 3

GPU with 2 SMs

SM 0 SM 1

Block 0 Block 1

Block 2 Block 3

GPU with 4 SMs

SM 0 SM 1

Block 0 Block 1 Block 2 Block 3

SM 2 SM 3

• SM creates, manages, schedules, and executes threads in groups of parallel threads - Warps

Page 25: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

Programming Model

• Warps are not easy

– Warp size is 32 threads on current GPUs

– Threads in a warp start together

– When an SM has a task or block: • The threads are split into warps

• A sort of “scheduling” is done

• Most of the time we have to ignore this

– Not all problems fit into a multiple of 32!

– Many papers claim 500x speed-up for matrix operations • The cases are for sizes of 32x32, 256x256, 512x512, ect…

Page 26: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

Programming Model

CUDA Grid

Block (0,0)

Block (1,0)

Block (2,0)

Block (0,1)

Block (1,1)

Block (2,1)

Block (0,2)

Block (1,2)

Block (2,2)

Block (1,1)

Thread (0,0)

Thread (1,0)

Thread (2,0)

Thread (0,1)

Thread (1,1)

Thread (2,1)

// Kernel definition

__global__ void VecAdd(float* A, float* B, float* C)

{

int i = threadIdx.x;

C[i] = A[i] + B[i];

}

int main()

{

...

// Kernel invocation with N threads

VecAdd<<<1, N>>>(A, B, C);

...

}

Page 27: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

Programming Model

• Memory must be allocated

// Host code

int main()

{

// Allocate input vectors in host memory

float* A_h = (float*)malloc(N * sizeof(float));

float* B_h = (float*)malloc(N * sizeof(float));

// Initialize input vectors

...

// Allocate vectors in device memory

float* A_d, B_d, C_d;

cudaMalloc(&A_d, N * sizeof(float));

cudaMalloc(&B_d, N * sizeof(float));

cudaMalloc(&C_d, N * sizeof(float));

// Copy vectors from host memory to device memory

cudaMemcpy(A_d, A_h, N * sizeof(float), cudaMemcpyHostToDevice);

cudaMemcpy(B_d, B_h, N * sizeof(float), cudaMemcpyHostToDevice);

// Invoke kernel

VecAdd<<<1, N>>>(A_d, B_d, C_d, N);

// Copy result from device memory to host memory

cudaMemcpy(C_h, C_d, N * sizeof(float), cudaMemcpyDeviceToHost);

// Free device memory

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

}

// Device code

__global__ void VecAdd(float* A, float* B, float* C, int N)

{

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)

C[i] = A[i] + B[i];

}

Page 28: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

Programming Model

• Thread Hierarchy

– Controlled by “dim3” declaration

– Threads have a limit!

// Kernel definition

__global__ void MatAdd(float A[N][N], float B[N][N],

float C[N][N])

{

int i = threadIdx.x;

int j = threadIdx.y;

C[i][j] = A[i][j] + B[i][j];

}

int main()

{

...

// Kernel invocation with one block of N * N * 1 threads

int numBlocks = 1;

dim3 threadsPerBlock(N, N);

MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

...

}

Page 29: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

• Now consider a multi-block multi-thread problem

• Break it up!

Programming Model

A B

Block (0,0) Block (1,0) Block (0,0) Block (1,0)

Block (0,1) Block (1,1) Block (0,1) Block (1,1)

Page 30: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

Programming Model

// Kernel definition

__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N])

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

if (i < N && j < N)

C[i][j] = A[i][j] + B[i][j];

}

int main()

{

...

// Kernel invocation

dim3 threadsPerBlock(2, 2);

dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);

MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

...

}

Block (0,0)

Thread (0,0) Thread (1,0)

Thread (0,1) Thread (1,1)

Threads Blocks Grid

Idx.x 0,1 0,1 -

Idx.y 0,1 0,1 -

Dim.x - 2 2

Dim.y - 2 2

Page 31: High Performance Computing and GPU Programming · Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model . Objectives •efore we begin…a little

Wrap Up

• Next time…

– CUDA for you • What you need, where to get it, how to install it

– Thread index mapping • 2-D or 3-D to 1-D

– Introduce CUDA memory types • Texture, local, global, shared

– Program interpolation function (if time permits it) • CPU vs. GPU implementation