Hetero Lecture Slides 002 Lecture 2 Lecture 2 2 Control Divergence

Embed Size (px)

DESCRIPTION

Hetero Lecture Slides002 Lecture 2 Lecture 2 2 Control Divergence

Citation preview

  • Kernel-based Parallel Programming

    - Control Divergence

    Lecture 2.2

  • Objective

    To understand the implications of warp scheduling on control flow

    Nature of control flow

    Warp partitioning

    Control divergence

  • Warps as Scheduling Units

    Each Block is divided into 32-thread Warps

    An implementation decision, not part of the CUDA

    programming model

    Warps are scheduling units in SM

    Threads in a warp execute in SIMD

    t0 t1 t2 t31

    t0 t1 t2 t31

    Block 1 Warps Block 2 Warps

    t0 t1 t2 t31

    Block 1 Warps

  • Back to Von-Neumann

    Every instruction needs to be fetched from memory, decoded, then executed.

    Instructions come in three flavors: Operate, Data transfer, and Program

    Control Flow.

    An example instruction cycle is the following:

    Fetch | Decode | Execute | Memory

  • Control Flow Operations

    Example of control flow instruction:

    BRp #-4

    if the condition is positive, jump

    back four instructions

  • How thread blocks are partitioned

    Thread blocks are partitioned into warps Thread IDs within a warp are consecutive

    and increasing

    Warp 0 starts with Thread ID 0

    Partitioning is always the same Thus you can use this knowledge in

    control flow

    However, the exact size of warps may change from generation to generation

    (Covered next)

    DO NOT rely on any ordering within or between warps If there are any dependencies between

    threads, you must __syncthreads() to get correct results (more later).

  • Control Flow Instructions

    Main performance concern with branching is divergence

    Threads within a single warp take different paths

    Different execution paths are serialized in current GPUs

    The control paths taken by the threads in a warp are traversed one at a time until there is no more.

  • Control Divergence Exampels

    Divergence can arise only when branch condition is a function of thread indices

    Example with divergence: If (threadIdx.x > 2) { } This creates two different control paths for threads in a block

    Branch granularity < warp size; threads 0, 1 and 2 follow different path than the rest of the threads in the first warp

    Example without divergence: If (blockIdx.x > 2) { } Branch granularity is a multiple

    of blocks size; all threads in any given warp follow the same path

  • Example: Vector Addition Kernel

    // Compute vector sum C = A+B

    // Each thread performs one pair-wise

    addition

    __global__

    void vecAddKernel(float* A, float* B, float*

    C, int n)

    {

    int i = threadIdx.x+blockDim.x*blockIdx.x;

    if(i

  • Analysis for vector size of 1,000 elements

    All threads in Blocks 0, 1, and 2 are within valid range

    i values from 0 to 767

    Block 3 will have control divergence

    1st group i values from 768-999

    2nd group i values from 1000-1023

    Performance penalty to Blocks with no divergence is very small

    Only the last Block will have divergence

    Overall performance penalty is small as long as there are large number of

    elements.

  • TO LEARN MORE,READ

    SECTION 4.3