38
Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. CUDA Lecture 7 CUDA Threads and Atomics

Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Embed Size (px)

Citation preview

Page 1: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 7CUDA Threads and

Atomics

Page 2: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

The Problem: how do you do global communication?Finish a kernel and start a new oneAll writes from all threads complete before a

kernel finishes

Would need to decompose kernels into before and after parts

CUDA Threads and Atomics – Slide 2

Topic 1: Atomics

step1<<<grid1,blk1>>>(…);// The system ensures that all// writes from step 1 completestep2<<<grid2,blk2>>>(…);

Page 3: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Alternately write to a predefined memory locationRace condition! Updates can be lost

What is the value of a in thread 0? In thread 1917?

CUDA Threads and Atomics – Slide 3

Race Conditions

threadID: 0 threadID: 1917

// vector[0] was equal to zero

vector[0] += 5;…a = vector[0];

vector[0] += 1;…a = vector[0];

Page 4: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Thread 0 could have finished execution before 1917 started

Or the other way aroundOr both are executing at the same timeAnswer: not defined by the programming

model; can be arbitrary

CUDA Threads and Atomics – Slide 4

Race Conditions (cont.)threadID: 0 threadID: 1917

// vector[0] was equal to zero

vector[0] += 5;…a = vector[0];

vector[0] += 1;…a = vector[0];

Page 5: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

CUDA provides atomic operations to deal with this problemAn atomic operation guarantees that only a

single thread has access to a piece of memory while an operation completes

The name atomic comes from the fact that it is uninterruptable

No dropped data, but ordering is still arbitrary

CUDA Threads and Atomics – Slide 5

Atomics

Page 6: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

CUDA provides atomic operations to deal with this problemRequires hardware with compute capability 1.1

and aboveDifferent types of atomic instructions

Addition/subtraction: atomicAdd, atomicSubMinimum/maximum: atomicMin, atomicMaxConditional increment/decrement: atomicInc, atomicDec

Exchange/compare-and-swap: atomicExch, atomicCAS

More types in fermi: atomicAnd, atomicOr, atomicXor

CUDA Threads and Atomics – Slide 6

Atomics (cont.)

Page 7: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

CUDA Threads and Atomics – Slide 7

Example: Histogram// Determine frequency of colors in a picture// colors have already been converted into integers// Each thread looks at one pixel and increments// a counter atomically__global__ void histogram(int* color, int* buckets){ int i = threadIdx.x + blockDim.x * blockIdx.x; int c = colors[i]; atomicAdd(&buckets[c], 1);}atomicAdd returns the previous value at a

certain addressUseful for grabbing variable amounts of data

from the list

Page 8: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

CUDA Threads and Atomics – Slide 8

Example: Workqueue// For algorithms where the amount of work per item// is highly non-uniform, it often makes sense for// to continuously grab work from a queue __global__ void workq(int* work_q, int* q_counter, int* output, int queue_max){ int i = threadIdx.x + blockDim.x * blockIdx.x; int q_index = atomicInc(q_counter, queue_max); int result = do_work(work_q[q_index]); output[i] = result;}

Page 9: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

if compare equals old value stored at address then val is stored instead

in either case, routine returns the value of old

seems a bizarre routine at first sight, but can be very useful for atomic locks

CUDA Threads and Atomics – Slide 9

Compare and Swapint atomicCAS(int* address, int compare, int val)

Page 10: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Most general type of atomicCan emulate all others with CAS

CUDA Threads and Atomics – Slide 10

Compare and Swap (cont.)int atomicCAS(int* address, int oldval, int val) { int old_reg_val = *address; if (old_reg_val == compare) *address = val; return old_reg_val; }

Page 11: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Atomics are slower than normal load/storeMost of these are associative operations on

signed/unsigned integers:quite fast for data in shared memoryslower for data in device memory

You can have the whole machine queuing on a single location in memory

Atomics unavailable on G80!

CUDA Threads and Atomics – Slide 11

Atomics

Page 12: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

CUDA Threads and Atomics – Slide 12

Example: Global Min/Max (Naïve)

// If you require the maximum across all threads// in a grid, you could do it with a single global// maximum value, but it will be VERY slow__global__ void global_max(int* values, int* gl_max){ int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; atomicMax(gl_max,val);}

Page 13: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

CUDA Threads and Atomics – Slide 13

Example: Global Min/Max (Better)

// introduce intermediate maximum results, so that // most threads do not try to update the global max__global__ void global_max(int* values, int* max, int *regional_maxes, int num_regions){ int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; int region = i % num_regions; if (atomicMax(&reg_max[region],val) < val) { atomicMax(max,val); }}

Page 14: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Single value causes serial bottleneckCreate hierarchy of values for more

parallelismPerformance will still be slow, so use

judiciously

CUDA Threads and Atomics – Slide 14

Global Max/Min

Page 15: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Can’t use normal load/store for inter-thread communication because of race conditions

Use atomic instructions for sparse and/or unpredictable global communication

Decompose data (very limited use of single global sum/max/min/etc.) for more parallelism

CUDA Threads and Atomics – Slide 15

Atomics: Summary

Page 16: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

How a streaming multiprocessor (SM) executes threadsOverview of how a streaming multiprocessor

worksSIMT ExecutionDivergence

CUDA Threads and Atomics – Slide 16

Topic 2: Streaming Multiprocessor Execution and Divergence

Page 17: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Hardware schedules thread blocks onto available SMsNo guarantee of ordering among thread blocksHardware will schedule thread blocks as soon as

a previous thread block finishedCUDA Threads and Atomics – Slide 17

Scheduling Blocks onto SMs

Thread Block 5

Thread Block 27

Thread Block 61

Streaming Multiprocessor

Thread Block 2001

Page 18: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

A warp = 32 threads launched togetherUsually execute together as well

CUDA Threads and Atomics – Slide 18

Recall: Warps

ALU

Control

ALU

Control

ALU

Control

ALU

Control

ALU

Control

ALU

Control

ALU ALU ALU

Control

ALU ALU ALU

Page 19: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Each thread block is mapped to one or more warps

The hardware schedules each warp independently

CUDA Threads and Atomics – Slide 19

Mapping of Thread Blocks

Thread Block N (128 threads)

TB N W1

TB N W2

TB N W3

TB N W4

Page 20: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

SM implements zero-overhead warp schedulingAt any time, only one of the warps is executed

by SM*Warps whose next instruction has its inputs

ready for consumption are eligible for execution

Eligible warps are selected for execution on a prioritized scheduling policy

All threads in a warp execute the same instruction when selected

CUDA Threads and Atomics – Slide 20

Thread Scheduling Example

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

Page 21: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Threads are executed in warps of 32, with all threads in the warp executing the same instruction at the same time

What happens if you have the following code?

CUDA Threads and Atomics – Slide 21

Control Flow Divergence

if (foo(threadIdx.x)){ do_A();}else{ do_B();}

Page 22: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

This is called warp divergence – CUDA will generate correct code to handle this, but to understand the performance you need to understand what CUDA does with it

CUDA Threads and Atomics – Slide 22

Control Flow Divergence (cont.)

if (foo(threadIdx.x)){ do_A();}else{ do_B();}

Page 23: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

CUDA Threads and Atomics – Slide 23

Control Flow Divergence (cont.)

From Fung et al. MICRO ’07

Branch

Path A

Path B

Branch

Path A

Path B

Page 24: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Nested branches are handled as well

CUDA Threads and Atomics – Slide 24

Control Flow Divergence (cont.)

if (foo(threadIdx.x)){ if (bar(threadIdx.x)) do_A(); else do_B();}else do_C();

Page 25: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

CUDA Threads and Atomics – Slide 25

Control Flow Divergence (cont.)

BranchBranch

Path A

Path C

Branch

Path B

Page 26: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

You don’t have to worry about divergence for correctnessMostly true, except corner cases (for example

intra-warp locks)You might have to think about it for

performanceDepends on your branch conditions

CUDA Threads and Atomics – Slide 26

Control Flow Divergence (cont.)

Page 27: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

One solution: NVIDIA GPUs have predicated instructions which are carried out only if a logical flag is true.

In the previous example, all threads compute the logical predicate and two predicated instructions

CUDA Threads and Atomics – Slide 27

Control Flow Divergence (cont.)

p = (foo(threadIdx.x)); p: do_A();!p: do_B();

Page 28: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Performance drops off with the degree of divergence

CUDA Threads and Atomics – Slide 28

Control Flow Divergence (cont.)

0 2 4 6 8 10 12 14 16 180

5

10

15

20

25

30

35

Divergence

Perform

ance

switch (threadIdx.x % N) { case 0: ... case 1: ...}

Page 29: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Performance drops off with the degree of divergenceIn worst case, effectively lose a factor of 32 in

performance if one thread needs expensive branch, while rest do nothing

Another example: processing a long list of elements where, depending on run-time values, a few require very expensive processing GPU implementation:first process list to build two sub-lists of “simple”

and “expensive” elementsthen process two sub-lists separately

CUDA Threads and Atomics – Slide 29

Control Flow Divergence (cont.)

Page 30: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Already introduced __syncthreads(); which forms a barrier – all threads wait until every one has reached this point.

When writing conditional code, must be careful to make sure that all threads do reach the __syncthreads();

Otherwise, can end up in deadlock

CUDA Threads and Atomics – Slide 30

Synchronization

Page 31: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Fermi supports some new synchronisation instructions which are similar to __syncthreads() but have extra capabilities:int __syncthreads_count(predicate)

counts how many predicates are trueint __syncthreads_and(predicate)

returns non-zero (true) if all predicates are trueint __syncthreads_or(predicate)

returns non-zero (true) if any predicate is true

CUDA Threads and Atomics – Slide 31

Synchronization (cont.)

Page 32: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

There are similar warp voting instructions which operate at the level of a warp:int __all(predicate)

returns non-zero (true) if all predicates in warp are true

int __any(predicate)returns non-zero (true) if any predicate is true

unsigned int __ballot(predicate)sets nth bit based on nth predicate

CUDA Threads and Atomics – Slide 32

Warp voting

Page 33: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Use very judiciouslyAlways include a max_iter in your spinloop!Decompose your data and your locks

CUDA Threads and Atomics – Slide 33

Topic 3: Locks

Page 34: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Problem: when a thread writes data to device memory the order of completion is not guaranteed, so global writes may not have completed by the time the lock is unlockedCUDA Threads and Atomics – Slide 34

Example: Global atomic lock// global variable: 0 unlocked, 1 locked__device__ int lock=0;__global__ void kernel(...) {...// set lockdo {} while(atomicCAS(&lock,0,1));...// free locklock = 0;}

Page 35: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

CUDA Threads and Atomics – Slide 35

Example: Global atomic lock (cont.)

// global variable: 0 unlocked, 1 locked__device__ int lock=0;__global__ void kernel(...) {...// set lockdo {} while(atomicCAS(&lock,0,1));...// free lock__threadfence(); // wait for writes to finishlock = 0;}

Page 36: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

__threadfence_block();wait until all global and shared memory writes

are visible to all threads in block__threadfence();

wait until all global and shared memory writes are visible to all threads in block (or all threads, for global data)

CUDA Threads and Atomics – Slide 36

Example: Global atomic lock (cont.)

Page 37: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

lots of esoteric capabilities – don’t worry about most of

themessential to understand warp divergence –

can have avery big impact on performance__syncthreads(); is vital the rest can be ignored until you have a

critical need – then read the documentation carefully and look for examples in the SDK

CUDA Threads and Atomics – Slide 37

Summary

Page 38: Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron

Based on original material fromOxford University: Mike GilesStanford University

Jared Hoberock, David Tarjan

Revision history: last updated 8/8/2011.

CUDA Threads and Atomics – Slide 38

End Credits