Upload
mikhail-nikitin
View
213
Download
1
Embed Size (px)
DESCRIPTION
Hetero Lecture Slides002 Lecture 2 Lecture 2 2 Control Divergence
Citation preview
Kernel-based Parallel Programming
- Control Divergence
Lecture 2.2
Objective
To understand the implications of warp scheduling on control flow
Nature of control flow
Warp partitioning
Control divergence
Warps as Scheduling Units
Each Block is divided into 32-thread Warps
An implementation decision, not part of the CUDA
programming model
Warps are scheduling units in SM
Threads in a warp execute in SIMD
t0 t1 t2 t31
t0 t1 t2 t31
Block 1 Warps Block 2 Warps
t0 t1 t2 t31
Block 1 Warps
Back to Von-Neumann
Every instruction needs to be fetched from memory, decoded, then executed.
Instructions come in three flavors: Operate, Data transfer, and Program
Control Flow.
An example instruction cycle is the following:
Fetch | Decode | Execute | Memory
Control Flow Operations
Example of control flow instruction:
BRp #-4
if the condition is positive, jump
back four instructions
How thread blocks are partitioned
Thread blocks are partitioned into warps Thread IDs within a warp are consecutive
and increasing
Warp 0 starts with Thread ID 0
Partitioning is always the same Thus you can use this knowledge in
control flow
However, the exact size of warps may change from generation to generation
(Covered next)
DO NOT rely on any ordering within or between warps If there are any dependencies between
threads, you must __syncthreads() to get correct results (more later).
Control Flow Instructions
Main performance concern with branching is divergence
Threads within a single warp take different paths
Different execution paths are serialized in current GPUs
The control paths taken by the threads in a warp are traversed one at a time until there is no more.
Control Divergence Exampels
Divergence can arise only when branch condition is a function of thread indices
Example with divergence: If (threadIdx.x > 2) { } This creates two different control paths for threads in a block
Branch granularity < warp size; threads 0, 1 and 2 follow different path than the rest of the threads in the first warp
Example without divergence: If (blockIdx.x > 2) { } Branch granularity is a multiple
of blocks size; all threads in any given warp follow the same path
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise
addition
__global__
void vecAddKernel(float* A, float* B, float*
C, int n)
{
int i = threadIdx.x+blockDim.x*blockIdx.x;
if(i
Analysis for vector size of 1,000 elements
All threads in Blocks 0, 1, and 2 are within valid range
i values from 0 to 767
Block 3 will have control divergence
1st group i values from 768-999
2nd group i values from 1000-1023
Performance penalty to Blocks with no divergence is very small
Only the last Block will have divergence
Overall performance penalty is small as long as there are large number of
elements.
TO LEARN MORE,READ
SECTION 4.3