Lect11 12 Cuda Threads

7/30/2019 Lect11 12 Cuda Threads

1/25

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

1

CUDA Threads


2/25


2

CUDA Thread Block

All threads in a block execute the same

kernel program (SPMD)

Programmer declares block:

Block size 1 to 512 concurrent threads

Block shape 1D, 2D, or 3D

Block dimensions in threads

Threads have thread id numbers within block Thread program uses thread id to select

work and address shared data

Threads in the same block share data and

synchronize while doing their share of thework

Threads in different blocks cannot cooperate

Each block can execute in any order relativeto other blocks!

CUDA Thread Block

Thread Id #:

0 1 2 3 m

Thread program

Courtesy: John Nickolls,

NVIDIA


3/25

CUDA Thread Organization Implementation

blockIdx& threadIdx:unique coordinates of

threads used to distinguish threads & identify foreach thread the appropriate portion of the data to

process:

Assigned to threads by the CUDA runtime system

Appear as bu i l t -in var iablesthat are ini t ial izedby the

runt ime systemand accessed within the kernel functions

References to the blockIdxand threadIdxvariables return

the appropriate values that form coordinates of the thread

gridDim& blockDim: Bu i l t in var iableswhich

provide the dimensions of the grid and block


3


4/25

4

Example Thread Organization

M=8 threads/block (1D organization)

threadIdx.x= 0, 1, 2, , 7

N thread blocks (1D organization) blockIdx.x= 0, 1, , N-1

8*N threads/grid: blockDim.x=8 ; gridDim.x=N

In the code of the figure threadID=blockIdx.x*blockDim.x + threadIdx.x

For Thread 3 of Block 5, threadID= 5*8+ 3 =43

Thread Block 0 Thread Block 1 Thread Block N - 1

float x = input[threadID];float y = func(x);output[threadID] = y;

threadID



76543210 76543210 76543210

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign


5/25

General Thread Organization

Grid: 2D array of blocks; Block: 3D array of threads

Execution Configuration determines exact organization

In the example of previous slide, if N=128 & M=32, then the

following Execution Configuration is used

Dim3 dimGrid(128,1,1);//dimGrid.x, dimGrid.ytake values 1 to 65,535

Dim3 dimBlock(32,1,1); // size of block limited to 512 threads

KernelFunction(.);

blockIdx.xranges between 0 &dimGrid.x-1


5


6/25


6

Host

Kernel

1

Kernel

2

Device

Grid 1

Block

(0, 0)

Block

(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy:

Block (1, 1)

Thread

(0,1,0)

Thread

(1,1,0)

Thread

(2,1,0)

Thread

(3,1,0)

Thread

(0,0,0)

Thread

(1,0,0)

Thread

(2,0,0)

Thread

(3,0,0)

(0,0,1) (1,0,1

) (2,0,1) (3,0,1)

Another Example

Dim3 dimGrid(2,2,1)

Dim3 dimBlock(4,2,2);

KernelFunction(.);

Block(1,0) has

blockIdx.x=1 & blockIdx.y=0

Thread(2,1,0) has

threadIdx.x=2 ;threadIdx.y=1;

threadIdx.z=0


7/25


7

Md

Nd

Pd

Pdsub

TILE_WIDTH

WIDTHWIDTH

Tx=threadIdx.x0 1 TILE_WIDTH-12

Bx=blockIdx.x0 1 2

By= blockIdx.y ty210

TILE_WIDTH-1

2

1

0

TILE_

WIDTH

E

WIDTH

WIDTH

Matrix Multiplication Using Multiple

Blocks Break-up Pd into tiles

Each thread block calculates one tile; Block size

equal tile size

Each thread calculates one Pd element

identified using

blockIdx.x & blockIdx.y to identifythe tile

hreadIdx.x, threadIdx.y to identifythe thtread within the tile


8/25


8

Md

Nd

Pd

Pdsub

TILE_WIDTH

WIDTHWIDTH

tx0 1 TILE_WIDTH-12

bx0 1 2

byty

210

TILE_WIDTH-1

2

1

0

TILE_

WIDTH

E

WIDTH

WIDTH

Revised Matrix Multiplication using

Multiple Blocks (cont.)

the y index of the Pd element computed by athread is y= by*TILE_WIDTH + ty

the x index of the Pd element computed by athread is x = bx*TILE_WIDTH + tx

Thread (tx,ty) in block (bx, by) usesrow y of Md and column x of Nd to update

element Pd[y*Width+x]


9/25


9

P1,0P0,0

P0,1

P2,0 P3,0

P1,1

P0,2 P2,2 P3,2P1,2

P3,1P2,1

P0,3 P2,3 P3,3P1,3

Block(0,0) Block(1,0)

Block(1,1)Block(0,1)

TILE_WIDTH = 2

A Small Example: P(4,4)

For Block(1,0)blockIdx.x=1 & blockIdx.y=0





10/25


10

Pd1,0

A Small Example: Multiplication

Md2,0

Md1,1

Md1,0Md0,0

Md0,1

Md3,0

Md2,1

Pd0,0

Md3,1 Pd0,1

Pd2,0Pd3,0

Nd0,3Nd1,3

Nd1,2

Nd1,1

Nd1,0Nd0,0

Nd0,1

Nd0,2

Pd1,1

Pd0,2 Pd2,2Pd3,2Pd1,2

Pd3,1Pd2,1

Pd0,3 Pd2,3Pd3,3Pd1,3

thread(0,0) of block(0,0) calculates Pd0,0





11/25

Revised Host Code for Launching the Revised Kernel

//Setup the execution configuration

Dim3 dimGrid(Width/TILE_WIDTH, Width/TILE_WIDTH,1);

Dim3 dimBlock(TILE_WIDTH, TILE_WIDTH,1);

//Launch the device computation threads

KernelFunctionMatrixMulKernel(Md, Nd, Pd, Width);

//Kernel can handle arrays of dimensions up to

1,048,560 X 1,048,560{16 X 65,535=1,048,560}

using 65,535 X 65,535= 4,294,836,225 blocks each with256 threads

Total number of parallel threads =

1,099,478,073,600 > 1 Tera (1012)threads


11


12/25


12

Revised Matrix Multiplication Kernel using

Multiple Blocks__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)

{// Calculate the row index of the Pd element and M

int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;

// Calculate the column idenx of Pd and N

int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;

// each thread computes one element of the block sub-matrix

for (int k = 0; k < Width; ++k)

Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];

Pd[Row*Width+Col] = Pvalue;

}


13/25


13

Transparent Scalability Cuda runtime system can execute blocks in any order

A kernel scales across any number of execution

resources Transparent Scalability: The ability to execute the same

application on hardware with different # of execution

resources

Device

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Each block can execute in any order relative to other

blocks.

Execution with

few resources

Kernel grid

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Device

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7time

Execution with more resources


14/25


14

Thread Block Assignment to Streaming Multiprocessors

Upon Kernel launch, threads are assignedto Streaming Multiprocessors, SMs in

block granularity

Up to 8 blocksassigned to each SM

as resources allows

SM in GT200 can take up to 1024threads

Could be 256 (threads/block) * 4

blocks

Or 128 (threads/block) * 8 blocks, etc

t0 t1 t2 tm

Blocks

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

t0 t1 t2 tm

Blocks

SM 1SM 0


15/25


15

Thread Block Assignment to Streaming Multiprocessors (cont.)

t0 t1 t2 tm

Blocks

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

t0 t1 t2 tm

Blocks

SM 1SM 0

GT200 has 30 SMs

Up to 240 (30*8) blocks execute

simultaneously

Up to 30,720 (30*1024) concurrentthreads can be residing in SMs for

executionSM maintains thread/block id #s

SM manages/schedules thread execution


16/25


16

GT200 Example: Thread Scheduling

Each Block is divided into 32-

thread Warps for scheduling

purposes Size of warp is

implementation dependant

not part of the CUDA

programming model

Warp is the unit of threadscheduling in SM

If 3 blocks are assigned to an

SM and each block has 256

threads, how many Warps are

there in an SM?

Each Block is divided into

256/32 = 8 Warps

There are 8 * 3 = 24 Warps

in each SM

t0 t1 t2 t31

t0 t1 t2 t31

Block 1 Warps Block 2 Warps

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1Streaming Multiprocessor

Shared Memory

t0 t1 t2 t31

Block 1 Warps


17/25


17

GT200 Example: Thread Scheduling (Cont.)

SM implements zero-overheadwarp scheduling:

At any time, only one of the warps is executed by SM

Warps whose next instruction has its operands ready for

consumption are eligible for execution

Eligible Warps are selected for execution on a prioritizedscheduling policy

All threads in a warp execute the same instruction when

selected (SIMD execution)

TB1

W1

TB = Thread Block, W = Warp

TB2

W1

TB3

W1

TB2

W1

TB1

W1

TB3

W2

TB1

W2

TB1

W3

TB3

W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4


18/25


18

GT200 Block Granularity Considerations

For Matrix Multiplication using multiple blocks, should I use

8X8, 16X16 or 32X32 blocks?

For 8X8 blocks, we have 64 threads per Block. Since each SM cantake up to 1024 threads, there are 16 Blocks (1024/64). However,

each SM can only take up to 8 Blocks. Hence only 512 (64*8)

threads will go into each SM!

SM execution resources are under utilized; fewer wraps to

schedule around long latency operations

For 16X16 blocks , we have 256 threads per Block. Since each SM

can take up to 1024 threads, it can take up to 4 Blocks (1024/256)

Full thread capacity in each SM

Maximal number of warps for scheduling around long-latencyoperations (1024/32= 32 wraps)

For 32X32 blocks, we have 1024 threads per Block. Not even one

can fit into an SM! (512 threads /block limitation)


19/25


19

Some Additional API Features


20/25


20

Application Programming Interface

The API is an extension to the C programming

language

It consists of:

Language extensions To target portions of the code for execution on the device

A runtime library split into:

1. A host component to control and access one or more

devices from the host

2. A device component providing device-specific functions

3. A common component providing built-in vector types and a

subset of the C runtime library in both host and device

codes


21/25


21

Language Extensions:

Built-in Variables

dim3 gridDim;

Dimensions of the grid in blocks (gridDim.z

unused) dim3blockDim;

Dimensions of the block in threads

dim3blockIdx;

Block index within the grid

dim3 threadIdx;

Thread index within the block


22/25


22

Common Runtime Component:

Mathematical Functions pow, sqrt, cbrt, hypot

exp, exp2, expm1

log, log2, log10, log1p

sin, cos, tan, asin, acos, atan, atan2

sinh, cosh, tanh, asinh, acosh, atanh

ceil, floor, trunc, round

Etc.

When executed on the host, a given function usesthe C runtime implementation if available

These functions are only supported for scalar types,

not vector types


23/25


23

Device Runtime Component:

Mathematical Functions

Some mathematical functions (e.g. sin(x))have a less accurate, but faster device-onlyversion (e.g.__sin(x))__pow

__log,__log2,__log10

__exp

__sin,__cos,__tan


24/25


24

Host Runtime Component

Provides functions to deal with:

Device management (including multi-device systems)

Memory management

Errorhandling

Initializes the first time a runtime function is called

A host thread can invoke device code on only onedevice

Multiple host threads required to run on multiple

devices


25/25

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL University of Illinois Urbana Champaign

25

Device Runtime Component:

Synchronization Function

void __syncthreads();

Synchronizes all threads in a block

Once all threads have reached this point,

execution resumes normally

Used to avoid RAW / WAR / WAW hazards

when accessing shared or global memory

Allowed in conditional constructs only if theconditional is uniform across the entire thread

block

Documents

Lect11 12 Cuda Threads