54
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult.ppt Matrix Multiplication Performance Improvements These notes will introduce: •Basic matrix multiplication on a 2-D grid/block review •Limitations •Block matrix multiplication •Memory access patterns in matrix multiplication •Coalescing memory accesses Transpose array Using shared memory tiles

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

  • Upload
    harry

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Matrix Multiplication Performance Improvements. These notes will introduce: Basic matrix multiplication on a 2-D grid/block review Limitations Block matrix multiplication Memory access patterns in matrix multiplication Coalescing memory accesses Transpose array Using shared memory tiles. - PowerPoint PPT Presentation

Citation preview

Page 1: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

1ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013MatrixMult.ppt

Matrix Multiplication Performance Improvements

These notes will introduce:•Basic matrix multiplication on a 2-D grid/block review•Limitations•Block matrix multiplication •Memory access patterns in matrix multiplication •Coalescing memory accesses

• Transpose array• Using shared memory tiles

Page 2: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

2

Matrix Multiplication

Matrix multiplication is an important application in HPC and appears in many applications

C = A * B

where A, B, and C are matrices (two-dimensional arrays.

A restricted case is when B has only one column -- matrix-vector multiplication, which appears in representation of linear equations and partial differential equations

Page 3: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

3

Matrix multiplication, C = A x B

Page 4: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

4

Assume matrices square (N x N matrices).

for (i = 0; i < N; i++)for (j = 0; j < N; j++) {

c[i][j] = 0;for (k = 0; k < N; k++)

c[i][j] = c[i][j] + a[i][k] * b[k][j];}

Requires n3 multiplications and n3 additionsSequential time complexity of O(n3). Very easy to parallelize.

Implementing Matrix MultiplicationSequential Code

Page 5: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

5

CUDA Kernelfor multiplying two arrays

__global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) {

int k, sum = 0;

int col = threadIdx.x + blockDim.x * blockIdx.x;

int row = threadIdx.y + blockDim.y * blockIdx.y;

if (col < N && row < N) {

for (k = 0; k < N; k++)

sum += a[row * N + k] * b[k * N + col];

c[row * N + col] = sum;

}

}

In this example, one thread computes one C element and the number of threads must equal or greater than the number of elements

Page 6: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

6

Sequential version with flattened arraysfor comparison

void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) {

int i, j, k, sum;

for (row =0; row < N; row++) // row of a

for (col =0; col < N; col++) { // column of b

sum = 0;

for(k = 0; k < N; k++)

sum += cpu_a[row * N + k] * cpu_b[k * N + col];

cpu_c[row * N + col] = sum;

}

}

Page 7: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

7

Matrix mapped on 2-D Grids and 2-D blocks

threadID.x

threadID.y

blockIdx.x * blockDim.x + threadIdx.x

blockIdx.y * blockDim.y + threadIdx.y

A[][column]

A[row][] Thread

Arrays mapped onto structure, one element per thread

Array

Grid

Block

Basically array divided into “tiles” and one tile mapped onto one block

Page 8: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

8

// Matrix addition program MatrixMult.cu, Barry Wilkinson, Dec. 28, 2010.#include <stdio.h>#include <cuda.h>#include <stdlib.h>

__global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) {…

}

void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) {…

}

int main(int argc, char *argv[]) {int i, j; // loop countersint Grid_Dim_x=1, Grid_Dim_y=1; //Grid structure valuesint Block_Dim_x=1, Block_Dim_y=1; //Block structure valuesint noThreads_x, noThreads_y; // number of threads available in device, each dimensionint noThreads_block; // number of threads in a blockint N = 10; // size of array in each dimensionint *a,*b,*c,*d;int *dev_a, *dev_b, *dev_c;int size; // number of bytes in arrayscudaEvent_t start, stop; // using cuda events to measure timefloat elapsed_time_ms; // which is applicable for asynchronous code also

/* --------------------ENTER INPUT PARAMETERS AND ALLOCATE DATA -----------------------*/… // keyboard input

dim3 Grid(Grid_Dim_x, Grid_Dim_x); //Grid structuredim3 Block(Block_Dim_x,Block_Dim_y); //Block structure, threads/block limited by specific devicesize = N * N * sizeof(int); // number of bytes in total in arrays

a = (int*) malloc(size); //dynamically allocated memory for arrays on hostb = (int*) malloc(size);c = (int*) malloc(size); // results from GPUd = (int*) malloc(size); // results from CPU… // load arrays with some numbers

Complete Program(several slides)

Page 9: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

9

/* ------------- COMPUTATION DONE ON GPU ----------------------------*/

cudaMalloc((void**)&dev_a, size); // allocate memory on devicecudaMalloc((void**)&dev_b, size);cudaMalloc((void**)&dev_c, size);

cudaMemcpy(dev_a, a , size ,cudaMemcpyHostToDevice);cudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice);

cudaEventRecord(start, 0); // here start time, after memcpy

gpu_matrixmult<<<Grid,Block>>>(dev_a,dev_b,dev_c,N);cudaMemcpy(c, dev_c, size , cudaMemcpyDeviceToHost);

cudaEventRecord(stop, 0); // measuse end timecudaEventSynchronize(stop);cudaEventElapsedTime(&elapsed_time_ms, start, stop );

printf("Time to calculate results on GPU: %f ms.\n", elapsed_time_ms);

Where you measure time will make a big difference

Page 10: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

10

/* ------------- COMPUTATION DONE ON HOST CPU ----------------------------*/

cudaEventRecord(start, 0); // use same timing*

cpu_matrixmult(a,b,d,N); // do calculation on host

cudaEventRecord(stop, 0); // measure end timecudaEventSynchronize(stop);cudaEventElapsedTime(&elapsed_time_ms, start, stop );

printf("Time to calculate results on CPU: %f ms.\n", elapsed_time_ms); // exe. time

/* ------------------- check device creates correct results -----------------*/…/* --------------------- repeat program ----------------------------------------*/… // while loop to repeat calc with different parameters/* -------------- clean up ---------------------------------------*/

free(a); free(b); free(c);cudaFree(dev_a);cudaFree(dev_b);cudaFree(dev_c);cudaEventDestroy(start);cudaEventDestroy(stop);return 0;

}

I have found problems if do CPU timing before GPU timing.Anyone else?

Page 11: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

11

Some PreliminariesEffects of First Launch

Program is written so that can repeat with different parameters without stopping program – to eliminate effect of first kernel launch

Also might take advantage of caching – seems not significant as first launch

Page 12: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

12

Some results

Random numbers 0- 9

32 x 32 array

1 blockof 32 x 32 threads

Speedup = 1.65,First time

Answer

Check both CPU and GPU same answers

Page 13: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

13

Some results

32 x 32 array

1 blockof 32 x 32 threads

Speedup = 2.12Second time

Page 14: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

14

Some results

32 x 32 array

1 blockof 32 x 32 threads

Speedup = 2.16Third time

Subsequently can vary 2.12 – 2.18

Page 15: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

15

Some results

256 x 256 array

8 blocksof 32 x 32 threads

Speedup = 151.86

Page 16: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

16

Some results

1024 x 1024 array

32 blocksof 32 x 32 threads

Speedup = 860.9

Page 17: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

17

Some results

2048 x 2048 array

64 blocksof 32 x 32 threads

Speedup = ??GPU appears to freeze.Why?

211 x 211 threads = 222 threads

Memory needed = 222 integers= 224 bytes= 4 Mbytes

Max number of threads on GPU appears to be 216 x 216 = 232 threadsServer has 4 Gbytes of main memory

Page 18: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

18

Array size Speedup * Speedup**

32 x 32 2.18 1.26

256 x 256 151.86 110.63

1024 x 1024 860.9 733.44

2048 x 2048 ??

4096 x 4096

Block size 32 x 32. Number of blocks to suit array size

Different Array Sizes

* These result include time of memory copy back from device but not memory copy to device.

** These result include time of memory copy back from the device and memory copy to the device

Page 19: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

19

GPU Limitations

Previous program has limitations:

•Number of threads => number of array elements(code will not work if number of threads < number of array elements)

•Number of threads/block and blocks/grid has GPU limitations, which will limit size of arrays that can be processed.

•Keyboard input must check for invalid grid and block value

•There are memory bandwidth issues.

Page 20: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

20

Compute capability 1.x

Maximum number of threads per block = 512

Maximum sizes of x- and y- dimension of thread block = 512

Maximum size of each dimension of grid = 65535

Maximum number of threads per block of 512 means a square 2-D block cannot be greater than 16 x 16 (256 threads)

So maximum square array size is 16 x 65535 (24 x 216 = 220)i.e. an array of 220 x 220

Is this a problem?

Page 21: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

21

Compute capability 2.x(coit-grid06.uncc.edu)

Maximum number of threads per block = 1024

Maximum sizes of x- and y- dimension of thread block = 1024

Maximum size of each dimension of grid = 65535

Max number of threads per block of 1024 means a square 2-D block cannot be greater than 32 x 32. Now all 1024 threads allocated

So maximum square array size is 32 x 65535 (25 x 216 = 221)i.e. an array of 221 x 221

Is this a problem?

Page 22: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

22

Increasing size of arrays beyond thread limitation of GPU

-- Using less threads than array elements*

Actually this one is easy and can draw upon regular technique of using sub-matrix multiplication:

Sub-matrix

Sub-matrixSub-matrix

* Would you want to use less threads than elements?Not in textbooks (not their “tiling”).

Page 23: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

23

To demonstrate that sum-matrix multiplication will produce the correct final answer, consider simple 2 x 2 sub-matrices:

Page 24: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

24

Suppose N x N matrices, divided into s2 submatrices.Each submatrix has n/s x n/s elements.

for (p = 0; p < s; p++) for (q = 0; q < s; q++) { Cp,q = 0; // clear elements of submatrix

for (r = 0; r < m; r++) // submatrix multiplication

Cp,q = Cp,q + Ap,r * Br,q; // add to accum. submatrix

}

Cp,q means submatrix row p and column q of matrix C. Cp,q = Cp,q + Ap,r * Br,q; means multiply submatrix Ap,r and Br,q using matrix multiplication and add to submatrix Cp,q using matrix addition.

Pseudo Code for Block Multiplication

Page 25: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

25

N = …; // no of elements in cols/rows of matricess = …; // no of sub matricesss = N/s; // no of elements in sub-matrix cols/rowsfor (i = 0; i < N; i+= ss) // go thro sub-matrices of Afor (j = 0; j < N; j+= ss) { // and sub-matrices of B

for (p = i; p < i + ss; p++) // clear elements of sub-matrix, Cp,q = 0;for (q = j; q < j + ss; q++)

C[p][q] = 0;

for (r = 0; r < s; r++) { // sub-matrices row, column

for (p = i; p < i + ss; p++) // submatrix multiplicationfor (q = j; q < j + ss; q++) {

c[p][q] = 0;for (k = r; k < r + ss; k++)

c[p][q] += A[p][k] * B[k][q];}C[i][j] += c[p][q]; // add to accum. submatrix

}}

Code for Block Multiplication

Not tested yetMistakes?

Each thread does one value of i and j.

S threads

N/s an integer

Page 26: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

26

Effects of memory access in matrix multiplication

One thread is responsible for computing one result C ij and needs access a row of A and a column of B:

Thread

Each thread access one row of A and one column of BN2 row/column combinations, N2 threads

Page 27: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

27

Seen another way, in first time period, each thread accesses the first element in a row of A:

Thread 0, …

Thread I, …

Thread N-1, …

Consider those threads that access different rowsGiven the row-major order of how A is stored, those threads will locations are not in consecutive locations

– Bad cannot do memory coalescing.

Question: how many threads access the same location?

Page 28: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

28

Next, each thread accesses the first element in a column of B:

Thread 0, … Thread I, … Thread N-1, …

Consider those threads that access different columnsGiven the row-major order of how A is stored, those threads will locations are in consecutive locations.

– Good! Can do memory coalescing.

Question: how many threads access the same location?

Page 29: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

29

How can we get better memory accesses and memory coalcesing?

1. Transpose one array Copy all rows of A to columns and all columns of A to rows before access A and modify program according.

(Not mentioned in course textbook or other NVIDIA book, although appears obvious way – see next about whether works!)

Page 30: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

30

Sequential code for a transpose using same array:

for (i=0; i < N; i++) for (j=0; j < i; j++) {

temp = B[i][j];B[i][j] = b[j][i];B[j][i] = temp;

}

(In my code, I use separate arrays)

Could be done on host prior to copying to device.How would the code look like if on device?

Page 31: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

31

/* ------ COMPUTATION DONE ON GPU USING A TRANSPOSED ARRAY-----*/

transposeArray(a, a_T, N); // transpose array

cudaEventRecord(start, 0); // here time measured before // host-device copy, but not transpose

// cudaEventSynchronize(start); // Needed?

cudaMemcpy(dev_a, a_T , size ,cudaMemcpyHostToDevice);// cpy transp. AcudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice); // copy B

gpu_matrixmult_T<<<Grid,Block>>>(dev_a,dev_b,dev_c,N);

cudaMemcpy(c_T,dev_c, size ,cudaMemcpyDeviceToHost);

cudaEventRecord(stop, 0); // measure end timecudaEventSynchronize(stop);cudaEventElapsedTime(&elapsed_time_ms2, start, stop );

printf("Time to calculate results on GPU with transposed array: %f ms.\n", elapsed_time_ms2); // print out execution time

Page 32: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

32

Some results

8 x 8 array

1 blockof 8 x 8 threads

Speedup = 1.62 over not transposing array

Page 33: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

33

Some results

32 x 32 array

1 blockof 32 x 32 threads

Speedup = 1.17 over not transposing array

Page 34: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

34

Some results

256 x 256 array

8 blocksof 32 x 32 threads

Speedup = 0.89!! over not transposing array

Page 35: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

35

Some results

1024 x 1024 array

32 blocksof 32 x 32 threads

Speedup = 0.93!! over not transposing array

Page 36: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

36

2. Using shared memory with “Tiling”

Programming Massively Parallel Processors A Hands-on Approach, Fig. 6.10, Page 109.

Copy of tiles made in shared memory

As Bs Cs

Note: this is not using block matrix multiplication. The fundamental algorithm is just divided into phases

Page 37: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

37

Developing Code

1. Declaring shared memory:

__global__ void gpu_matrixmult (int* Md, int* Nd, int* Pd, int Width) {

__shared__ int Mds[TILE_WIDTH][TILE_WIDTH];__shared__ int Nds[TILE_WIDTH][TILE_WIDTH];

Needs to be using static memory allocationso will need to fix size of tileConvenient to choose same as kernel block size, say 32 x 32

Page 38: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

38

2. Index into shared memory:

For convenience declare tile(block) and thread indices, and indices into final C array:

int bx = blockIdx.x; // tile (block) indicesint by = blockIdx.y;

int tx = threadIdx.x; //thread indices int ty = threadIdx.y;

To access (later):

Mds[ty][tx] = … // element associated with thread Nds[ty][tx] = …

Page 39: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

39

3. Global address:

For convenience declare row and column:

int Row = by * TILE_WIDTH + ty; // global indicesint Col = bx * TILE_WIDTH + tx;

Note: Same as the usual global thread ID and result index in normal matrix multiplication:

int row = threadIdx.y + blockDim.y * blockIdx.y;int col = threadIdx.x + blockDim.x * blockIdx.x;

Page 40: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

40

4. Loading shared memory:

Done using SIMT thread collaboration (very tricky):

for (int m = 0; m < N/TILE_WIDTH; m++) {

Mds[ty][tx] = Md[Row*N + (m*TILE_WIDTH + tx)];

Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*N + Col];

__syncthreads();

// do matrix multiplication operations on pair of tiles …

Thread ID

All elements on row transferred, one per thread

Thread ID

All elements on column transferred, one per thread

Wait for all threads in block

For each tile in row or column

Books says achieves memory coalescing although it does not look to do that in both cases.

Page 41: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

41

Example (3 x 3 tiles)

Based upon Programming Massively Parallel Processors A Hands-on Approach, Fig. 6.10, Page 109.

by/bx

m k

Global array A

1

0

2

10 2

m

k

Global array B

1

0

2

10 2

Global memory addressRow*N + (m*TILE_WIDTH + tx)

Global memory address((m*TILE_WIDTH + ty)*N + Col

m = tile number 0 ,1, and 2Row = by * TILE_WIDTH + tyCol = bx * TILE_WIDTH + tx

Page 42: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

42

5. Matrix multiplication in shared memory:

int Pvalue = 0;

for (int m = 0; m < N/TILE_WIDTH; m++) {

… // copy tiles to shared memory (step 4)

for ( int k = 0; k < TILE_WIDTH; k++)

Pvalue += Mds[ty][k] * Nds[k][tx];

}

Pd[Row * N + Col] = Pvalue;

}

Copy back to global memory

Multiply tiles, accumulating values

Page 43: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) {

__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;int tx = threadIdx.x; int ty = threadIdx.y; // thread ID

int Row = by * TILE_WIDTH + ty; //Identify row, column of Pd element to work onint Col = bx * TILE_WIDTH + tx;

float Pvalue = 0;

for (int m = 0; m < Width/TILE_WIDTH; m++) { // loop over Md, Nd to compute Pd element

Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; // load Md, Nd tiles into sh. memNds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];__syncthreads();

for ( int k = 0; k < TILE_WIDTH; k++) Pvalue += Mds[ty][k] * Nds[k][tx];}Pd[Row][Col] = Pvalue;

}43

Code given in book Programming Massively

Parallel Processors A Hands-on Approach, Fig. 6.11, Page 110

A mistake in the book as Pd is a pointer and size of array unknown? Will not compile as is. Need to use flattened index

Note copying to shared memory is a collaborative between threads, each thread doing one element of each arrayIn wrong place on page 110, although does

not effect final answer!

Page 44: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

44

Some results

32 x 32 array

1 blockof 32 x 32 threads

SpeedupGPU to CPU

2.12GPU sh. mem. to CPU

2.73GPU sh. mem to GPU

without sh memory1.28

Results of second run

Page 45: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

45

Some results

256 x 256 array

8 blockof 32 x 32 threads

SpeedupGPU to CPU

153GPU sh. mem. to CPU

217GPU sh. mem to GPU

without sh memory 1.41

Page 46: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

46

Some results

1024 x1024 array

32 blockof 32 x 32 threads

SpeedupGPU to CPU

864GPU sh. mem. to CPU

2214 !!!GPU sh. mem to GPU

without sh memory 2.56

Page 47: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

47

Some results

2048 x 2048 array

64 blockof 32 x 32 threads

SpeedupGPU to CPU

989GPU sh. mem. to CPU

2962 !!GPU sh. mem to GPU

without sh memory 2.99

Page 48: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

48

Speedup

Array size GPU to CPU GPU using shared memory to CPU

GPU using shared memory to GPU not using shared memory

32 x 32 2.12 2.73 1.28

256 x 256 153 217 1.41

1024 x 1024 864 2214 2.56

2048 x 2048 989 2962 2.99

4096 x 4096

Block size 32 x 32. Number of blocks to suit array size

Different Array Sizes

Page 49: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

49

Bandwidth improvements by using shared memory

Using a 32 x 32 tiles reduced the number of global memory access by a factor of 16 (two transfers instead of 2 x 32 transfers).

According to PMPP book, page 90, using 16 x 16 tiles:

“it allows the 86.4 GB/s global bandwidth to serve a much larger floating-point computation rate … Can now support 86.4/4 x 16 = 345.6 gigaflops, very close to the peak floating point performance of the G80… effectively removes global memory bandwidth as the major limiting factor of matrix multiplication performance.”

Page 50: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

50

Conclusions

Using the shared memory algorithm can make a significant difference, up to 3 times as fast on

the GPU to not using this algorithm in presented tests.

Speedup of almost 3000 over the CPU! (Note though CPU is an old processor)

Page 51: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

51

This topic may be explored further in an assignment

Need better toolsNVIDIA offers a debugging tool called Parallel Nsight for Visual Studio/Windows

http://parallelnsight.nvidia.com/

Page 52: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

52

Further Reading

Programming Massively Parallel ProcessorsA hands-on Approach

David B. Kirk and Wen-mei W. Hwu

Morgan Kaufmann, 2010

This book is only on NVIDIA GPUs and CUDA programming despite its title.

Page 53: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

Questions

Page 54: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult

54

Things in PMPP book (Ch 6) not covered yet:

Dynamic partitioning of SM resourcesData PrefetchingInstruction usageThread granularity

Also note page 108 says memory coalescing not needed for shared memory!