Upload
hangoc
View
226
Download
2
Embed Size (px)
Citation preview
1
1
ECE 8823A
GPU Architectures
Module 3: CUDA Execution Model -I
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Objective
• A more detailed look at kernel execution– Data to thread assignment
• To understand the organization and scheduling of threads– Resource assignment at the block level– Scheduling at the warp level– Basics of SIMT execution
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
2
Reading Assignment
• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 4
• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6.3
• Reference: CUDA Programming Guide– http://docs.nvidia.com/cuda/cuda-c-programming-
guide/#abstract
3
host device
Kernel 1
Grid 1Block (0, 0)
Block (1, 1)
Block (1, 0)
Block (0, 1)
Kernel 2
Grid 2Block (1,1)
Thread
(0,0,0)Thread(0,1,3)
Thread(0,1,0)
Thread(0,1,1)
Thread(0,1,2)
Thread(0,0,0)
Thread(0,0,1)
Thread(0,0,2)
Thread(0,0,3)
(1,0,0) (1,0,1) (1,0,2) (1,0,3)
A Multi-Dimensional Grid Example
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
3
Built-In Variables
• 1D-3D Grid of thread blocks– Built-in: gridDim
• gridDim.x, gridDim.y, gridDim.z– Built-in: blockDim
• blockDim.x, blockDim.y, blockDim.z
Example• dim3 dimGrid (32,2,2) - 3D grid of thread blocks• dim3 dimGrid (2,2,1) - 2D grid of thread blocks• dim3 dimGrid (32,1,1) - 1D grid of thread blocks• dim3 dimGrid ((n/256.0),1,1) - 1D grid of thread blocks
my_kernel<<<dimGrid, dimBlock>>>(..)© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Built-In Variables (2)
• 1D-3D grid of threads in a thread block– Built-in: blockIdx
• blockIDx.x, blockIDx.y, blockIDx.z– Built-in: threadIDx
• threadIDx.x, threadIDx.y, threadIDX.z– All blocks have the same thread configuration
Example• dim3 dimBlock (4,2,2) - 3D grid of thread blocks• dim3 dimBlock (2,2,1) - 2D grid of thread blocks• dim3 dimBlock (32,1,1) - 1D grid of thread blocks
my_kernel<<<dimGrid, dimBlock>>>(..)© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
4
Built-In Variables (3)
• 1D-3D grid of threads in a thread block– Built-in: blockIdx
• blockIDx.x, blockIDx.y, blockIDx.z– Built-in: threadIDx
• threadIDx.x, threadIDx.y, threadIDX.z
• Initialized by the runtime through a kernel call• Range fixed by the compute capability and target
devices– You can query the device (later)
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
2D Examples
5
16×16 blocks
Processing a Picture with a 2D Grid
72x62 pixels
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
M0,2
M1,1
M0,1M0,0
M1,0
M0,3
M1,2 M1,3
M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3
M2,1M2,0 M2,2 M2,3
M3,1M3,0 M3,2 M3,3
M3,1M3,0 M3,2 M3,3
M
Row*Width+Col = 2*4+1 = 9 M2M1M0 M3 M5M4 M6 M7 M9M8 M10 M11 M13M12 M14 M15
M
Row-Major Layout in C/C++
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
6
Source Code of the Picture Kernel
11
__global__ void PictureKernel(float* d_Pin, float* d_Pout, int n,intm) {
// Calculate the row # of the d_Pin and d_Pout element to processint Row = blockIdx.y*blockDim.y + threadIdx.y;
// Calculate the column # of the d_Pin and d_Pout element to processint Col = blockIdx.x*blockDim.x + threadIdx.x;
// each thread computes one element of d_Pout if in rangeif ((Row < m) && (Col < n)) {
d_Pout[Row*n+Col] = 2*d_Pin[Row*n+Col];}
}
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Figure 4.5 Covering a 76×62 picture with 16×blocks.
• Storage layout of data
• Assign unique ID
• Map IDs to Data (access)
Approach Summary
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
7
A Simple Running ExampleMatrix Multiplication
• A simple illustration of the basic features of memory and thread management in CUDA programs– Thread index usage– Memory layout– Register usage– Assume square matrix for simplicity– Leave shared memory usage until later
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Square Matrix-Matrix Multiplication• P = M * N of size WIDTH x WIDTH
– Each thread calculates one element of P
– Each row of M is loaded WIDTH times from global memory
– Each column of N is loaded WIDTH times from global memory
M
N
P
WIDTH
WIDTH
WIDTH WIDTH
8
M0,2
M1,1
M0,1M0,0
M1,0
M0,3
M1,2 M1,3
M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3
M2,1M2,0 M2,2 M2,3
M3,1M3,0 M3,2 M3,3
M3,1M3,0 M3,2 M3,3
M
Row*Width+Col = 2*4+1 = 9 M2M1M0 M3 M5M4 M6 M7 M9M8 M10 M11 M13M12 M14 M15
M
Row-Major Layout in C/C++
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Matrix MultiplicationA Simple Host Version in C
M
N
P
WIDTH
WIDTH
WIDTH WIDTH
// Matrix multiplication on the (CPU) host in double precisionvoid MatrixMulOnHost(float* M, float* N, float* P, int Width) {
for (int i = 0; i < Width; ++i)for (int j = 0; j < Width; ++j) {map
double sum = 0;for (int k = 0; k < Width; ++k)
double a = M[i * Width + k];double b = N[k * Width + j];sum += a * b;
}P[i * Width + j] = sum;
}}
i
k
k
j
9
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Kernel Version: Functional Description
M
N
P
WIDTH
WIDTH
WIDTH WIDTH
for (int k = 0; k < Width; ++k)Pvalue += d_M[Row*Width+k] *
d_N[k*Width+Col];
i
k
k
j
blockDim.y
blockDim.x
col
row
• Which thread is at coordinate (row,col)?
• Threads self allocate
Kernel Function - A Small Example
• Have each 2D thread block to compute a (TILE_WIDTH)2
sub-matrix (tile) of the result matrix– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13
P0,1P0,0
P1,0
P0,2 P0,3
P1,1
P2,0 P2,2 P2,3P2,1
P1,3P1,2
P3,0 P2,3 P3,3P3,1
Block(0,0) Block(0,1)
Block(1,1)Block(1,0)
WIDTH = 4; TILE_WIDTH = 2Each block has 2*2 = 4 threads
WIDTH/TILE_WIDTH = 2Use 2* 2 = 4 blocks
10
A Slightly Bigger Example
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13
P0,1P0,0
P1,0
P0,2 P0,3
P1,1
P2,0 P2,2 P2,3P2,1
P1,3P1,2
P3,0 P3,2 P3,3P3,1
P0,5P0,4
P1,4
P0,6 P0,7
P1,5
P2,4 P2,6 P2,7P2,5
P1,7P1,6
P3,4 P3,6 P3,7P3,5
P4,1P4,0
P5,0
P4,2 P4,3
P5,1
P6,0 P6,2 P6,3P6,1
P5,3P5,2
P7,0 P7,2 P7,3P7,1
P4,5P4,4
P5,4
P4,6 P4,7
P5,5
P6,4 P6,6 P6,7P6,5
P5,7P5,6
P7,4 P7,6 P7,7P7,5
WIDTH = 8; TILE_WIDTH = 2Each block has 2*2 = 4 threads
WIDTH/TILE_WIDTH = 4Use 4* 4 = 16 blocks
A Slightly Bigger Example (cont.)
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13
WIDTH = 8; TILE_WIDTH = 4Each block has 4*4 =16 threads
WIDTH/TILE_WIDTH = 2Use 2* 2 = 4 blocks
P0,1P0,0
P1,0
P0,2 P0,3
P1,1
P2,0 P2,2 P2,3P2,1
P1,3P1,2
P3,0 P3,2 P3,3P3,1
P0,5P0,4
P1,4
P0,6 P0,7
P1,5
P2,4 P2,6 P2,7P2,5
P1,7P1,6
P3,4 P3,6 P3,7P3,5
P4,1P4,0
P5,0
P4,2 P4,3
P5,1
P6,0 P6,2 P6,3P6,1
P5,3P5,2
P7,0 P7,2 P7,3P7,1
P4,5P4,4
P5,4
P4,6 P4,7
P5,5
P6,4 P6,6 P6,7P6,5
P5,7P5,6
P7,4 P7,6 P7,7P7,5
11
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
// Setup the execution configuration// TILE_WIDTH is a #define constant
dim3 dimGrid(Width/TILE_WIDTH, Width/TILE_WIDTH, 1);dim3 dimBlock(TILE_WIDTH, TILE_WIDTH, 1);
// Launch the device computation threads!MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
Kernel Invocation (Host-side Code)
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
Kernel Function
// Matrix multiplication kernel – per thread code
__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width) {
// Pvalue is used to store the element of the matrix// that is computed by the threadfloat Pvalue = 0;
12
Col = 0 * 2 + threadIdx.xRow = 0 * 2 + threadIdx.y
Col = 0
Col = 1
Work for Block (0,0)in a TILE_WIDTH = 2 Configuration
© David Kirk/NVIDIA and Wen-mei W. Hwu, Urbana, July 9-13
P0,1P0,0
P1,0
P0,2 P0,3
P1,1
P2,0 P2,2 P2,3P2,1
P1,3P1,2
P3,0 P3,2 P3,3P3,1
M0,1M0,0
M1,0
M0,2 M0,3
M1,1
M2,0 M2,2 M2,3M2,1
M1,3M1,2
M3,0 M3,2 M3,3M3,1
N0,1N0,0
N1,0
N0,2 N0,3
N1,1
N2,0 N2,2 N2,3N2,1
N1,3N1,2
N3,0 N3,2 N3,3N3,1
Row = 0Row = 1
blockIdx.x blockIdx.y
blockDim.x blockDim.y
Work for Block (0,1)
© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
P0,1P0,0
P0,1
P0,2 P0,3
P1,1
P2,0 P2,2 P2,3P2,1
P1,3P1,2
P3,0 P3,2 P3,3P3,1
Row = 0Row = 1
Col = 2
Col = 3
Col = 1 * 2 + threadIdx.xRow = 0 * 2 + threadIdx.y
blockIdx.x blockIdx.y
M0,1M0,0
M1,0
M0,2 M0,3
M1,1
M2,0 M2,2 M2,3M2,1
M1,3M1,2
M3,0 M3,2 M3,3M3,1
N0,1N0,0
N1,0
N0,2 N0,3
N1,1
N2,0 N2,2 N2,3N2,1
N1,3N1,2
N3,0 N2,3 N3,3N3,1
blockDim.x blockDim.y
13
A Simple Matrix Multiplication Kernel__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int Width){// Calculate the row index of the d_P element and d_M
int Row = blockIdx.y*blockDim.y+threadIdx.y;// Calculate the column idenx of d_P and d_N
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) {float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)Pvalue += d_M[Row*Width+k] *
d_N[k*Width+Col];d_P[Row*Width+Col] = Pvalue;}
}© David Kirk/NVIDIA and Wen-mei Hwu, 2007-2012 ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign
QUESTIONS?
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign26