Upload
renato-quezada-e
View
228
Download
0
Embed Size (px)
Citation preview
7/30/2019 Lect11 12 Cuda Threads
1/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
1
CUDA Threads
7/30/2019 Lect11 12 Cuda Threads
2/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
2
CUDA Thread Block
All threads in a block execute the same
kernel program (SPMD)
Programmer declares block:
Block size 1 to 512 concurrent threads
Block shape 1D, 2D, or 3D
Block dimensions in threads
Threads have thread id numbers within block Thread program uses thread id to select
work and address shared data
Threads in the same block share data and
synchronize while doing their share of thework
Threads in different blocks cannot cooperate
Each block can execute in any order relativeto other blocks!
CUDA Thread Block
Thread Id #:
0 1 2 3 m
Thread program
Courtesy: John Nickolls,
NVIDIA
7/30/2019 Lect11 12 Cuda Threads
3/25
CUDA Thread Organization Implementation
blockIdx& threadIdx:unique coordinates of
threads used to distinguish threads & identify foreach thread the appropriate portion of the data to
process:
Assigned to threads by the CUDA runtime system
Appear as bu i l t -in var iablesthat are ini t ial izedby the
runt ime systemand accessed within the kernel functions
References to the blockIdxand threadIdxvariables return
the appropriate values that form coordinates of the thread
gridDim& blockDim: Bu i l t in var iableswhich
provide the dimensions of the grid and block
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
3
7/30/2019 Lect11 12 Cuda Threads
4/25
4
Example Thread Organization
M=8 threads/block (1D organization)
threadIdx.x= 0, 1, 2, , 7
N thread blocks (1D organization) blockIdx.x= 0, 1, , N-1
8*N threads/grid: blockDim.x=8 ; gridDim.x=N
In the code of the figure threadID=blockIdx.x*blockDim.x + threadIdx.x
For Thread 3 of Block 5, threadID= 5*8+ 3 =43
Thread Block 0 Thread Block 1 Thread Block N - 1
float x = input[threadID];float y = func(x);output[threadID] = y;
threadID
float x = input[threadID];float y = func(x);output[threadID] = y;
float x = input[threadID];float y = func(x);output[threadID] = y;
76543210 76543210 76543210
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
7/30/2019 Lect11 12 Cuda Threads
5/25
General Thread Organization
Grid: 2D array of blocks; Block: 3D array of threads
Execution Configuration determines exact organization
In the example of previous slide, if N=128 & M=32, then the
following Execution Configuration is used
Dim3 dimGrid(128,1,1);//dimGrid.x, dimGrid.ytake values 1 to 65,535
Dim3 dimBlock(32,1,1); // size of block limited to 512 threads
KernelFunction(.);
blockIdx.xranges between 0 &dimGrid.x-1
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
5
7/30/2019 Lect11 12 Cuda Threads
6/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
6
Host
Kernel
1
Kernel
2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy:
Block (1, 1)
Thread
(0,1,0)
Thread
(1,1,0)
Thread
(2,1,0)
Thread
(3,1,0)
Thread
(0,0,0)
Thread
(1,0,0)
Thread
(2,0,0)
Thread
(3,0,0)
(0,0,1) (1,0,1
) (2,0,1) (3,0,1)
Another Example
Dim3 dimGrid(2,2,1)
Dim3 dimBlock(4,2,2);
KernelFunction(.);
Block(1,0) has
blockIdx.x=1 & blockIdx.y=0
Thread(2,1,0) has
threadIdx.x=2 ;threadIdx.y=1;
threadIdx.z=0
7/30/2019 Lect11 12 Cuda Threads
7/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
7
Md
Nd
Pd
Pdsub
TILE_WIDTH
WIDTHWIDTH
Tx=threadIdx.x0 1 TILE_WIDTH-12
Bx=blockIdx.x0 1 2
By= blockIdx.y ty210
TILE_WIDTH-1
2
1
0
TILE_
WIDTH
E
WIDTH
WIDTH
Matrix Multiplication Using Multiple
Blocks Break-up Pd into tiles
Each thread block calculates one tile; Block size
equal tile size
Each thread calculates one Pd element
identified using
blockIdx.x & blockIdx.y to identifythe tile
hreadIdx.x, threadIdx.y to identifythe thtread within the tile
7/30/2019 Lect11 12 Cuda Threads
8/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
8
Md
Nd
Pd
Pdsub
TILE_WIDTH
WIDTHWIDTH
tx0 1 TILE_WIDTH-12
bx0 1 2
byty
210
TILE_WIDTH-1
2
1
0
TILE_
WIDTH
E
WIDTH
WIDTH
Revised Matrix Multiplication using
Multiple Blocks (cont.)
the y index of the Pd element computed by athread is y= by*TILE_WIDTH + ty
the x index of the Pd element computed by athread is x = bx*TILE_WIDTH + tx
Thread (tx,ty) in block (bx, by) usesrow y of Md and column x of Nd to update
element Pd[y*Width+x]
7/30/2019 Lect11 12 Cuda Threads
9/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
9
P1,0P0,0
P0,1
P2,0 P3,0
P1,1
P0,2 P2,2 P3,2P1,2
P3,1P2,1
P0,3 P2,3 P3,3P1,3
Block(0,0) Block(1,0)
Block(1,1)Block(0,1)
TILE_WIDTH = 2
A Small Example: P(4,4)
For Block(1,0)blockIdx.x=1 & blockIdx.y=0
For Block(0,0)blockIdx.x=0 & blockIdx.y=0
For Block(1,1)blockIdx.x=1 & blockIdx.y=1
For Block(0,1)blockIdx.x=0 & blockIdx.y=1
7/30/2019 Lect11 12 Cuda Threads
10/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
10
Pd1,0
A Small Example: Multiplication
Md2,0
Md1,1
Md1,0Md0,0
Md0,1
Md3,0
Md2,1
Pd0,0
Md3,1 Pd0,1
Pd2,0Pd3,0
Nd0,3Nd1,3
Nd1,2
Nd1,1
Nd1,0Nd0,0
Nd0,1
Nd0,2
Pd1,1
Pd0,2 Pd2,2Pd3,2Pd1,2
Pd3,1Pd2,1
Pd0,3 Pd2,3Pd3,3Pd1,3
thread(0,0) of block(0,0) calculates Pd0,0
thread(0,0) of block(0,1) calculates Pd0,2
thread(1,1) of block(0,0) calculates Pd1,1
thread(1,1) of block(0,1) calculates Pd1,3
7/30/2019 Lect11 12 Cuda Threads
11/25
Revised Host Code for Launching the Revised Kernel
//Setup the execution configuration
Dim3 dimGrid(Width/TILE_WIDTH, Width/TILE_WIDTH,1);
Dim3 dimBlock(TILE_WIDTH, TILE_WIDTH,1);
//Launch the device computation threads
KernelFunctionMatrixMulKernel(Md, Nd, Pd, Width);
//Kernel can handle arrays of dimensions up to
1,048,560 X 1,048,560{16 X 65,535=1,048,560}
using 65,535 X 65,535= 4,294,836,225 blocks each with256 threads
Total number of parallel threads =
1,099,478,073,600 > 1 Tera (1012)threads
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
11
7/30/2019 Lect11 12 Cuda Threads
12/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
12
Revised Matrix Multiplication Kernel using
Multiple Blocks__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
Pd[Row*Width+Col] = Pvalue;
}
7/30/2019 Lect11 12 Cuda Threads
13/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
13
Transparent Scalability Cuda runtime system can execute blocks in any order
A kernel scales across any number of execution
resources Transparent Scalability: The ability to execute the same
application on hardware with different # of execution
resources
Device
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Each block can execute in any order relative to other
blocks.
Execution with
few resources
Kernel grid
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Device
Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 Block 6 Block 7time
Execution with more resources
7/30/2019 Lect11 12 Cuda Threads
14/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
14
Thread Block Assignment to Streaming Multiprocessors
Upon Kernel launch, threads are assignedto Streaming Multiprocessors, SMs in
block granularity
Up to 8 blocksassigned to each SM
as resources allows
SM in GT200 can take up to 1024threads
Could be 256 (threads/block) * 4
blocks
Or 128 (threads/block) * 8 blocks, etc
t0 t1 t2 tm
Blocks
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
t0 t1 t2 tm
Blocks
SM 1SM 0
7/30/2019 Lect11 12 Cuda Threads
15/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
15
Thread Block Assignment to Streaming Multiprocessors (cont.)
t0 t1 t2 tm
Blocks
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
t0 t1 t2 tm
Blocks
SM 1SM 0
GT200 has 30 SMs
Up to 240 (30*8) blocks execute
simultaneously
Up to 30,720 (30*1024) concurrentthreads can be residing in SMs for
executionSM maintains thread/block id #s
SM manages/schedules thread execution
7/30/2019 Lect11 12 Cuda Threads
16/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
16
GT200 Example: Thread Scheduling
Each Block is divided into 32-
thread Warps for scheduling
purposes Size of warp is
implementation dependant
not part of the CUDA
programming model
Warp is the unit of threadscheduling in SM
If 3 blocks are assigned to an
SM and each block has 256
threads, how many Warps are
there in an SM?
Each Block is divided into
256/32 = 8 Warps
There are 8 * 3 = 24 Warps
in each SM
t0 t1 t2 t31
t0 t1 t2 t31
Block 1 Warps Block 2 Warps
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1Streaming Multiprocessor
Shared Memory
t0 t1 t2 t31
Block 1 Warps
7/30/2019 Lect11 12 Cuda Threads
17/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
17
GT200 Example: Thread Scheduling (Cont.)
SM implements zero-overheadwarp scheduling:
At any time, only one of the warps is executed by SM
Warps whose next instruction has its operands ready for
consumption are eligible for execution
Eligible Warps are selected for execution on a prioritizedscheduling policy
All threads in a warp execute the same instruction when
selected (SIMD execution)
TB1
W1
TB = Thread Block, W = Warp
TB2
W1
TB3
W1
TB2
W1
TB1
W1
TB3
W2
TB1
W2
TB1
W3
TB3
W2
Time
TB1, W1 stallTB3, W2 stallTB2, W1 stall
Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4
7/30/2019 Lect11 12 Cuda Threads
18/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
18
GT200 Block Granularity Considerations
For Matrix Multiplication using multiple blocks, should I use
8X8, 16X16 or 32X32 blocks?
For 8X8 blocks, we have 64 threads per Block. Since each SM cantake up to 1024 threads, there are 16 Blocks (1024/64). However,
each SM can only take up to 8 Blocks. Hence only 512 (64*8)
threads will go into each SM!
SM execution resources are under utilized; fewer wraps to
schedule around long latency operations
For 16X16 blocks , we have 256 threads per Block. Since each SM
can take up to 1024 threads, it can take up to 4 Blocks (1024/256)
Full thread capacity in each SM
Maximal number of warps for scheduling around long-latencyoperations (1024/32= 32 wraps)
For 32X32 blocks, we have 1024 threads per Block. Not even one
can fit into an SM! (512 threads /block limitation)
7/30/2019 Lect11 12 Cuda Threads
19/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
19
Some Additional API Features
7/30/2019 Lect11 12 Cuda Threads
20/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
20
Application Programming Interface
The API is an extension to the C programming
language
It consists of:
Language extensions To target portions of the code for execution on the device
A runtime library split into:
1. A host component to control and access one or more
devices from the host
2. A device component providing device-specific functions
3. A common component providing built-in vector types and a
subset of the C runtime library in both host and device
codes
7/30/2019 Lect11 12 Cuda Threads
21/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
21
Language Extensions:
Built-in Variables
dim3 gridDim;
Dimensions of the grid in blocks (gridDim.z
unused) dim3blockDim;
Dimensions of the block in threads
dim3blockIdx;
Block index within the grid
dim3 threadIdx;
Thread index within the block
7/30/2019 Lect11 12 Cuda Threads
22/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
22
Common Runtime Component:
Mathematical Functions pow, sqrt, cbrt, hypot
exp, exp2, expm1
log, log2, log10, log1p
sin, cos, tan, asin, acos, atan, atan2
sinh, cosh, tanh, asinh, acosh, atanh
ceil, floor, trunc, round
Etc.
When executed on the host, a given function usesthe C runtime implementation if available
These functions are only supported for scalar types,
not vector types
7/30/2019 Lect11 12 Cuda Threads
23/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
23
Device Runtime Component:
Mathematical Functions
Some mathematical functions (e.g. sin(x))have a less accurate, but faster device-onlyversion (e.g.__sin(x))__pow
__log,__log2,__log10
__exp
__sin,__cos,__tan
7/30/2019 Lect11 12 Cuda Threads
24/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign
24
Host Runtime Component
Provides functions to deal with:
Device management (including multi-device systems)
Memory management
Errorhandling
Initializes the first time a runtime function is called
A host thread can invoke device code on only onedevice
Multiple host threads required to run on multiple
devices
7/30/2019 Lect11 12 Cuda Threads
25/25
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL University of Illinois Urbana Champaign
25
Device Runtime Component:
Synchronization Function
void __syncthreads();
Synchronizes all threads in a block
Once all threads have reached this point,
execution resumes normally
Used to avoid RAW / WAR / WAW hazards
when accessing shared or global memory
Allowed in conditional constructs only if theconditional is uniform across the entire thread
block