112
Accelerated Computing GPU Teaching Kit Histogramming Module 7.1 – Parallel Computation Patterns (Histogram)

GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 2: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 3: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 4: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 5: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 6: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
Page 7: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 8: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 9: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 10: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 11: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 12: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 13: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 14: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 15: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 16: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 17: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 18: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 19: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 20: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 21: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 22: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 23: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 24: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 25: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 26: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 27: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 28: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 29: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 30: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 31: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 32: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 33: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 34: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 35: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 36: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 37: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 38: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 39: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 40: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 41: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 42: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 43: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 44: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 45: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 46: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 47: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 48: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 49: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 50: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 51: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 52: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 53: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 54: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 55: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 56: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 57: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 58: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 59: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

HistogrammingModule 71 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 60: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

2

Objectivendash To learn the parallel histogram computation pattern

ndash An important useful computationndash Very different from all the patterns we have covered so far in terms of

output behavior of each threadndash A good starting point for understanding output interference in parallel

computation

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 61: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

3

Histogramndash A method for extracting notable features and patterns from large data

setsndash Feature extraction for object recognition in imagesndash Fraud detection in credit card transactionsndash Correlating heavenly object movements in astrophysicsndash hellip

ndash Basic histograms - for each element in the data set use the value to identify a ldquobin counterrdquo to increment

3

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 62: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

4

A Text Histogram Examplendash Define the bins as four-letter sections of the alphabet a-d e-h i-l n-

p hellipndash For each character in an input string increment the appropriate bin

counterndash In the phrase ldquoProgramming Massively Parallel Processorsrdquo the

output histogram is shown below

4

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 63: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Chart1

Series 1
5
5
6
10
10
1
1

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
a-d
e-h
i-l
m-p
q-t
u-x
y-z
Page 64: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Sheet1

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Series 1
a-d 5
e-h 5
i-l 6
m-p 10
q-t 10
u-x 1
y-z 1
Page 65: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

55

A simple parallel histogram algorithm

ndash Partition the input into sectionsndash Have each thread to take a section of the inputndash Each thread iterates through its sectionndash For each letter increment the appropriate bin counter

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 66: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

6

Sectioned Partitioning (Iteration 1)

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 67: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

7

Sectioned Partitioning (Iteration 2)

7

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 68: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

8

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 69: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

9

Input Partitioning Affects Memory Access Efficiency

ndash Sectioned partitioning results in poor memory access efficiencyndash Adjacent threads do not access adjacent memory locationsndash Accesses are not coalescedndash DRAM bandwidth is poorly utilized

ndash Change to interleaved partitioningndash All threads process a contiguous section of elements ndash They all move to the next section and repeatndash The memory accesses are coalesced

1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 70: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

10

Interleaved Partitioning of Inputndash For coalescing and better memory access performance

hellip

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 71: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

11

Interleaved Partitioning (Iteration 2)

hellip

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 72: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 73: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

Introduction to Data RacesModule 72 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 74: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

2

Objectivendash To understand data races in parallel computing

ndash Data races can occur when performing read-modify-write operationsndash Data races can cause errors that are hard to reproducendash Atomic operations are designed to eliminate such data races

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 75: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

3

Read-modify-write in the Text Histogram Example

ndash For coalescing and better memory access performance

hellip

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 76: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

4

Read-Modify-Write Used in Collaboration Patterns

ndash For example multiple bank tellers count the total amount of cash in the safe

ndash Each grab a pile and countndash Have a central display of the running totalndash Whenever someone finishes counting a pile read the current running

total (read) and add the subtotal of the pile to the running total (modify-write)

ndash A bad outcomendash Some of the piles were not accounted for in the final total

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 77: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

5

A Common Parallel Service Patternndash For example multiple customer service agents serving waiting customers ndash The system maintains two numbers

ndash the number to be given to the next incoming customer (I)ndash the number for the customer to be served next (S)

ndash The system gives each incoming customer a number (read I) and increments the number to be given to the next customer by 1 (modify-write I)

ndash A central display shows the number for the customer to be served nextndash When an agent becomes available heshe calls the number (read S) and

increments the display number by 1 (modify-write S)ndash Bad outcomes

ndash Multiple customers receive the same number only one of them receives servicendash Multiple agents serve the same number

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 78: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

6

A Common Arbitration Patternndash For example multiple customers booking airline tickets in parallelndash Each

ndash Brings up a flight seat map (read)ndash Decides on a seatndash Updates the seat map and marks the selected seat as taken (modify-

write)

ndash A bad outcomendash Multiple passengers ended up booking the same seat

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 79: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

7

Data Race in Parallel Thread Execution

Old and New are per-thread register variables

Question 1 If Mem[x] was initially 0 what would the value of Mem[x] be after threads 1 and 2 have completed

Question 2 What does each thread get in their Old variable

Unfortunately the answers may vary according to the relative execution timing between the two threads which is referred to as a data race

thread1 thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 80: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

8

Timing Scenario 1Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 1ndash Mem[x] = 2 after the sequence

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 81: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

9

Timing Scenario 2Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (1) Mem[x] New

4 (1) Old Mem[x]

5 (2) New Old + 1

6 (2) Mem[x] New

ndash Thread 1 Old = 1ndash Thread 2 Old = 0ndash Mem[x] = 2 after the sequence

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 82: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

10

Timing Scenario 3Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 83: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

11

Timing Scenario 4Time Thread 1 Thread 2

1 (0) Old Mem[x]

2 (1) New Old + 1

3 (0) Old Mem[x]

4 (1) Mem[x] New

5 (1) New Old + 1

6 (1) Mem[x] New

ndash Thread 1 Old = 0ndash Thread 2 Old = 0ndash Mem[x] = 1 after the sequence

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 84: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

12

Purpose of Atomic Operations ndash To Ensure Good Outcomes

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

thread1

thread2 Old Mem[x]New Old + 1Mem[x] New

Old Mem[x]New Old + 1Mem[x] New

Or

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 85: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 86: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

Atomic Operations in CUDAModule 73 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 87: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

2

Objectivendash To learn to use atomic operations in parallel programming

ndash Atomic operation conceptsndash Types of atomic operations in CUDAndash Intrinsic functionsndash A basic histogram kernel

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 88: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

3

Data Race Without Atomic Operations

ndash Both threads receive 0 in Oldndash Mem[x] becomes 1

thread1

thread2 Old Mem[x]

New Old + 1

Mem[x] New

Old Mem[x]

New Old + 1

Mem[x] New

Mem[x] initialized to 0

time

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 89: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

4

Key Concepts of Atomic Operationsndash A read-modify-write operation performed by a single hardware

instruction on a memory location addressndash Read the old value calculate a new value and write the new value to the

locationndash The hardware ensures that no other threads can perform another

read-modify-write operation on the same location until the current atomic operation is completendash Any other threads that attempt to perform an atomic operation on the

same location will typically be held in a queuendash All threads perform their atomic operations serially on the same location

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 90: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

5

Atomic Operations in CUDAndash Performed by calling functions that are translated into single instructions

(aka intrinsic functions or intrinsics)ndash Atomic add sub inc dec min max exch (exchange) CAS (compare

and swap)ndash Read CUDA C programming Guide 40 or later for details

ndash Atomic Addint atomicAdd(int address int val)

ndash reads the 32-bit word old from the location pointed to by address in global or shared memory computes (old + val) and stores the result back to memory at the same address The function returns old

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 91: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

6

More Atomic Adds in CUDAndash Unsigned 32-bit integer atomic add

unsigned int atomicAdd(unsigned int addressunsigned int val)

ndash Unsigned 64-bit integer atomic addunsigned long long int atomicAdd(unsigned long long

int address unsigned long long int val)

ndash Single-precision floating-point atomic add (capability gt 20)ndash float atomicAdd(float address float val)

6

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 92: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

7

A Basic Text Histogram Kernelndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

atomicAdd( amp(histo[buffer[i]]) 1)i += stride

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 93: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

8

A Basic Histogram Kernel (cont)ndash The kernel receives a pointer to the input buffer of byte valuesndash Each thread process the input in a strided pattern

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

int i = threadIdxx + blockIdxx blockDimx

stride is total number of threadsint stride = blockDimx gridDimx

All threads handle blockDimx gridDimx consecutive elementswhile (i lt size)

int alphabet_position = buffer[i] ndash ldquoardquoif (alphabet_position gt= 0 ampamp alpha_position lt 26) atomicAdd(amp(histo[alphabet_position4]) 1)i += stride

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 94: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 95: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

Atomic Operation PerformanceModule 74 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 96: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

2

Objectivendash To learn about the main performance considerations of atomic

operationsndash Latency and throughput of atomic operationsndash Atomic operations on global memoryndash Atomic operations on shared L2 cachendash Atomic operations on shared memory

2

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 97: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

3

Atomic Operations on Global Memory (DRAM)ndash An atomic operation on a

DRAM location starts with a read which has a latency of a few hundred cycles

ndash The atomic operation ends with a write to the same location with a latency of a few hundred cycles

ndash During this whole time no one else can access the location

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 98: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

4

Atomic Operations on DRAMndash Each Read-Modify-Write has two full memory access delays

ndash All atomic operations on the same variable (DRAM location) are serialized

atomic operation N atomic operation N+1

timeDRAM read latency DRAM read latencyDRAM write latency DRAM write latency

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 99: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

5

Latency determines throughputndash Throughput of atomic operations on the same DRAM location is the

rate at which the application can execute an atomic operation

ndash The rate for atomic operation on a particular location is limited by the total latency of the read-modify-write sequence typically more than 1000 cycles for global memory (DRAM) locations

ndash This means that if many threads attempt to do atomic operation on the same location (contention) the memory throughput is reduced to lt 11000 of the peak bandwidth of one memory channel

5

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 100: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

6

You may have a similar experience in supermarket checkout

ndash Some customers realize that they missed an item after they started to check out

ndash They run to the isle and get the item while the line waitsndash The rate of checkout is drasticaly reduced due to the long latency of running to the

isle and backndash Imagine a store where every customer starts the check out before

they even fetch any of the itemsndash The rate of the checkout will be 1 (entire shopping time of each customer)

6

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 101: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

7

Hardware Improvements ndash Atomic operations on Fermi L2 cache

ndash Medium latency about 110 of the DRAM latencyndash Shared among all blocksndash ldquoFree improvementrdquo on Global Memory atomics

atomic operation N atomic operation N+1

time

L2 latency L2 latency L2 latency L2 latency

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 102: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

8

Hardware Improvementsndash Atomic operations on Shared Memory

ndash Very short latencyndash Private to each thread blockndash Need algorithm work by programmers (more later)

atomic operation

N

atomic operation

N+1

time

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 103: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 104: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

Privatization Technique for Improved ThroughputModule 75 ndash Parallel Computation Patterns (Histogram)

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 105: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

2

Objectivendash Learn to write a high performance kernel by privatizing outputs

ndash Privatization as a technique for reducing latency increasing throughput and reducing serialization

ndash A high performance privatized histogram kernel ndash Practical example of using shared memory and L2 cache atomic

operations

2

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 106: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

3

Privatization

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Heavy contention and serialization

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 107: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

4

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 108: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

5

Privatization (cont)

Final Copy

hellipBlock 0 Block 1 Block N

Atomic Updates

Copy 0 Copy 1

Final Copy

Copy Nhellip

Block 0 Block 1 Block Nhellip

Much less contention and serialization

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 109: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

6

Cost and Benefit of Privatizationndash Cost

ndash Overhead for creating and initializing private copiesndash Overhead for accumulating the contents of private copies into the final copy

ndash Benefitndash Much less contention and serialization in accessing both the private copies and the

final copyndash The overall performance can often be improved more than 10x

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 110: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

7

Shared Memory Atomics for Histogram ndash Each subset of threads are in the same blockndash Much higher throughput than DRAM (100x) or L2 (10x) atomicsndash Less contention ndash only threads in the same block can access a

shared memory variablendash This is a very important use case for shared memory

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 111: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

8

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

8

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 112: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

9

Shared Memory Atomics Requires Privatization

ndash Create private copies of the histo[] array for each thread block

__global__ void histo_kernel(unsigned char bufferlong size unsigned int histo)

__shared__ unsigned int histo_private[7]

if (threadIdxx lt 7) histo_private[threadidxx] = 0__syncthreads()

9

Initialize the bin counters in the private copies of histo[]

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 113: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

10

Build Private Histogram

int i = threadIdxx + blockIdxx blockDimx stride is total number of threads

int stride = blockDimx gridDimxwhile (i lt size)

atomicAdd( amp(private_histo[buffer[i]4) 1)i += stride

10

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 114: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

11

Build Final Histogram wait for all other threads in the block to finish__syncthreads()

if (threadIdxx lt 7) atomicAdd(amp(histo[threadIdxx]) private_histo[threadIdxx] )

11

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 115: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

12

More on Privatizationndash Privatization is a powerful and frequently used technique for

parallelizing applications

ndash The operation needs to be associative and commutativendash Histogram add operation is associative and commutativendash No privatization if the operation does not fit the requirement

ndash The private histogram size needs to be smallndash Fits into shared memory

ndash What if the histogram is too large to privatizendash Sometimes one can partially privatize an output histogram and use range

testing to go to either global memory or shared memory

12

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License

Page 116: GPU Teaching Kit - Purdue Universitysmidkiff/ece563/NVidiaGPUTeachin… · GPU Teaching Kit Histogramming. Module 7.1 – Parallel Computation Patterns (Histogram) 2. Objective –

Accelerated Computing

GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 40 International License