Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 7CUDA Threads and

Atomics

The Problem: how do you do global communication?Finish a kernel and start a new oneAll writes from all threads complete before a

kernel finishes

Would need to decompose kernels into before and after parts

CUDA Threads and Atomics – Slide 2

Topic 1: Atomics

step1<<<grid1,blk1>>>(…);// The system ensures that all// writes from step 1 completestep2<<<grid2,blk2>>>(…);

Alternately write to a predefined memory locationRace condition! Updates can be lost

What is the value of a in thread 0? In thread 1917?


Race Conditions

threadID: 0 threadID: 1917

// vector[0] was equal to zero

vector[0] += 5;…a = vector[0];


Thread 0 could have finished execution before 1917 started

Or the other way aroundOr both are executing at the same timeAnswer: not defined by the programming

model; can be arbitrary


Race Conditions (cont.)threadID: 0 threadID: 1917

// vector[0] was equal to zero



CUDA provides atomic operations to deal with this problemAn atomic operation guarantees that only a

single thread has access to a piece of memory while an operation completes

The name atomic comes from the fact that it is uninterruptable

No dropped data, but ordering is still arbitrary


Atomics

CUDA provides atomic operations to deal with this problemRequires hardware with compute capability 1.1

and aboveDifferent types of atomic instructions

Addition/subtraction: atomicAdd, atomicSubMinimum/maximum: atomicMin, atomicMaxConditional increment/decrement: atomicInc, atomicDec

Exchange/compare-and-swap: atomicExch, atomicCAS

More types in fermi: atomicAnd, atomicOr, atomicXor


Atomics (cont.)


Example: Histogram// Determine frequency of colors in a picture// colors have already been converted into integers// Each thread looks at one pixel and increments// a counter atomically__global__ void histogram(int* color, int* buckets){ int i = threadIdx.x + blockDim.x * blockIdx.x; int c = colors[i]; atomicAdd(&buckets[c], 1);}atomicAdd returns the previous value at a

certain addressUseful for grabbing variable amounts of data

from the list


Example: Workqueue// For algorithms where the amount of work per item// is highly non-uniform, it often makes sense for// to continuously grab work from a queue __global__ void workq(int* work_q, int* q_counter, int* output, int queue_max){ int i = threadIdx.x + blockDim.x * blockIdx.x; int q_index = atomicInc(q_counter, queue_max); int result = do_work(work_q[q_index]); output[i] = result;}

if compare equals old value stored at address then val is stored instead

in either case, routine returns the value of old

seems a bizarre routine at first sight, but can be very useful for atomic locks


Compare and Swapint atomicCAS(int* address, int compare, int val)

Most general type of atomicCan emulate all others with CAS


Compare and Swap (cont.)int atomicCAS(int* address, int oldval, int val) { int old_reg_val = *address; if (old_reg_val == compare) *address = val; return old_reg_val; }

Atomics are slower than normal load/storeMost of these are associative operations on

signed/unsigned integers:quite fast for data in shared memoryslower for data in device memory

You can have the whole machine queuing on a single location in memory

Atomics unavailable on G80!


Atomics


Example: Global Min/Max (Naïve)

// If you require the maximum across all threads// in a grid, you could do it with a single global// maximum value, but it will be VERY slow__global__ void global_max(int* values, int* gl_max){ int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; atomicMax(gl_max,val);}


Example: Global Min/Max (Better)

// introduce intermediate maximum results, so that // most threads do not try to update the global max__global__ void global_max(int* values, int* max, int *regional_maxes, int num_regions){ int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; int region = i % num_regions; if (atomicMax(&reg_max[region],val) < val) { atomicMax(max,val); }}

Single value causes serial bottleneckCreate hierarchy of values for more

parallelismPerformance will still be slow, so use

judiciously


Global Max/Min

Can’t use normal load/store for inter-thread communication because of race conditions

Use atomic instructions for sparse and/or unpredictable global communication

Decompose data (very limited use of single global sum/max/min/etc.) for more parallelism


Atomics: Summary

How a streaming multiprocessor (SM) executes threadsOverview of how a streaming multiprocessor

worksSIMT ExecutionDivergence


Topic 2: Streaming Multiprocessor Execution and Divergence

Hardware schedules thread blocks onto available SMsNo guarantee of ordering among thread blocksHardware will schedule thread blocks as soon as

a previous thread block finishedCUDA Threads and Atomics – Slide 17

Scheduling Blocks onto SMs

Thread Block 5

Thread Block 27

Thread Block 61

Streaming Multiprocessor

Thread Block 2001

A warp = 32 threads launched togetherUsually execute together as well


Recall: Warps

ALU

Control

ALU

Control

ALU

Control

ALU

Control

ALU

Control

ALU

Control

ALU ALU ALU

Control

ALU ALU ALU

Each thread block is mapped to one or more warps

The hardware schedules each warp independently


Mapping of Thread Blocks

Thread Block N (128 threads)

TB N W1

TB N W2

TB N W3

TB N W4

SM implements zero-overhead warp schedulingAt any time, only one of the warps is executed

by SM*Warps whose next instruction has its inputs

ready for consumption are eligible for execution

Eligible warps are selected for execution on a prioritized scheduling policy

All threads in a warp execute the same instruction when selected


Thread Scheduling Example

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

Threads are executed in warps of 32, with all threads in the warp executing the same instruction at the same time

What happens if you have the following code?


Control Flow Divergence

if (foo(threadIdx.x)){ do_A();}else{ do_B();}

This is called warp divergence – CUDA will generate correct code to handle this, but to understand the performance you need to understand what CUDA does with it


Control Flow Divergence (cont.)

if (foo(threadIdx.x)){ do_A();}else{ do_B();}



From Fung et al. MICRO ’07

Branch

Path A

Path B

Branch

Path A

Path B

Nested branches are handled as well



if (foo(threadIdx.x)){ if (bar(threadIdx.x)) do_A(); else do_B();}else do_C();



BranchBranch

Path A

Path C

Branch

Path B

You don’t have to worry about divergence for correctnessMostly true, except corner cases (for example

intra-warp locks)You might have to think about it for

performanceDepends on your branch conditions



One solution: NVIDIA GPUs have predicated instructions which are carried out only if a logical flag is true.

In the previous example, all threads compute the logical predicate and two predicated instructions



p = (foo(threadIdx.x)); p: do_A();!p: do_B();

Performance drops off with the degree of divergence



0 2 4 6 8 10 12 14 16 180

5

10

15

20

25

30

35

Divergence

Perform

ance

switch (threadIdx.x % N) { case 0: ... case 1: ...}

Performance drops off with the degree of divergenceIn worst case, effectively lose a factor of 32 in

performance if one thread needs expensive branch, while rest do nothing

Another example: processing a long list of elements where, depending on run-time values, a few require very expensive processing GPU implementation:first process list to build two sub-lists of “simple”

and “expensive” elementsthen process two sub-lists separately



Already introduced __syncthreads(); which forms a barrier – all threads wait until every one has reached this point.

When writing conditional code, must be careful to make sure that all threads do reach the __syncthreads();

Otherwise, can end up in deadlock


Synchronization

Fermi supports some new synchronisation instructions which are similar to __syncthreads() but have extra capabilities:int __syncthreads_count(predicate)

counts how many predicates are trueint __syncthreads_and(predicate)

returns non-zero (true) if all predicates are trueint __syncthreads_or(predicate)

returns non-zero (true) if any predicate is true


Synchronization (cont.)

There are similar warp voting instructions which operate at the level of a warp:int __all(predicate)

returns non-zero (true) if all predicates in warp are true

int __any(predicate)returns non-zero (true) if any predicate is true

unsigned int __ballot(predicate)sets nth bit based on nth predicate


Warp voting

Use very judiciouslyAlways include a max_iter in your spinloop!Decompose your data and your locks


Topic 3: Locks

Problem: when a thread writes data to device memory the order of completion is not guaranteed, so global writes may not have completed by the time the lock is unlockedCUDA Threads and Atomics – Slide 34

Example: Global atomic lock// global variable: 0 unlocked, 1 locked__device__ int lock=0;__global__ void kernel(...) {...// set lockdo {} while(atomicCAS(&lock,0,1));...// free locklock = 0;}


Example: Global atomic lock (cont.)

// global variable: 0 unlocked, 1 locked__device__ int lock=0;__global__ void kernel(...) {...// set lockdo {} while(atomicCAS(&lock,0,1));...// free lock__threadfence(); // wait for writes to finishlock = 0;}

__threadfence_block();wait until all global and shared memory writes

are visible to all threads in block__threadfence();

wait until all global and shared memory writes are visible to all threads in block (or all threads, for global data)


Example: Global atomic lock (cont.)

lots of esoteric capabilities – don’t worry about most of

themessential to understand warp divergence –

can have avery big impact on performance__syncthreads(); is vital the rest can be ignored until you have a

critical need – then read the documentation carefully and look for examples in the SDK


Summary

Based on original material fromOxford University: Mike GilesStanford University

Jared Hoberock, David Tarjan

Revision history: last updated 8/8/2011.


End Credits

Documents

Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron