Kyle Spafford Keeneland Tutorial April 14, 2011. 2Managed by UT-Battelle for the U.S. Department of Energy The Scalable Heterogeneous Computing Benchmark

SHOC: Overview and Kernel

Walkthrough

Kyle Spafford

Keeneland Tutorial

April 14, 2011

2 Managed by UT-Battellefor the U.S. Department of Energy

The Scalable Heterogeneous Computing Benchmark Suite (SHOC)• Focus on scientific computing

workloads, including common kernels like SGEMM, FFT, Stencils

• Parallelized with MPI, with support for multi-GPU and cluster scale comparisons

• Implement in both CUDA and OpenCL for a 1:1 comparison

• Include system, stability tests

SHOC Results Browser (beta)http://ft.ornl.gov/~kspafford/shoctown/Shoctown.html


Download SHOC

• Source code– http://ft.ornl.gov/doku/shoc/downloads

• Build and Run– http://ft.ornl.gov/doku/shoc/gettingstarted

• sh ./conf/config-keeneland.sh

• make

• cd tools

• perl driver.pl –cuda –s 4

– Includes example output for Keeneland

• FAQ– http://ft.ornl.gov/doku/shoc/faq

http://ft.ornl.gov/doku/shoc/downloads

http://ft.ornl.gov/doku/shoc/gettingstarted

http://ft.ornl.gov/doku/shoc/faq


SHOC Categories

• Performance:– Level 0

• Speeds and feeds: raw FLOPS rates, bandwidths, latencies

– Level 1• Algorithms: FFT, matrix multiply, stencil, sort, etc.

– Level 2:• Application kernels: S3D (chemistry), molecular dynamics

• System:– PCIe Contention, MPI latency vs. host-device bandwidth, NUMA

• Stability:– FFT-based, error detection


(Level 0 Example): DeviceMemory

• Motivation– Determine sustainable device memory bandwidth– Benchmark local, global, and image memory

• Basic design– Test different memory access patterns, i.e. coalesced,

uncoalesced– Measure both read and write bandwidth– Vary number of threads in a block

Coalesced

Thread sequential /Uncoalesced

Thread 1

Thread 2

Thread 3

Thread 4


SHOC: Level 0 Tests

• BusSpeedDownload/Readback– Measures bandwidth/latency of the PCIe bus

• DeviceMemory– Measures global/constant/shared memory

• KernelCompilation– Measures OpenCL JIT kernel compilation speeds

• MaxFlops– Measures achievable FLOPS (synthetic, not-bandwidth bound)

• QueueDelay– Measures OpenCL queueing system overhead


(Level 1 Example): Stencil2D

• Motivation– Supports investigation of accelerator

usage within parallel application context– Serial and True Parallel versions

• Basic design– 9-point stencil operation applied to 2D data set– MPI uses 2D Cartesian data distribution, with periodic halo exchanges– Applies stencil to data in local memory

• OpenCL/CUDA observations– Runtime dominated by data movement

• Between host and card

• Between MPI processes


SHOC: Level 1 Tests• FFT

• Reduction

• Scan

• SGEMM

• Sort

• SpMV

• Stencil2D

• Triad


(Level 2 Example): S3D• Motivation

– Measure performance of important DOE application

– S3D solves Navier-Stokes equations for a regular 3D domain, used to simulate combustion

• Basic design– Assign each grid point to a device

thread– Highly parallel, as grid points are

independent

• OpenCL/CUDA observations– CUDA outperforms OpenCL

• Big factor: native transcendentals (sin, cos, tan, etc.)

3D Regular Domain Decomposition – Each thread handles a grid point, blocks handle regions


SHOC: Other Tests

• Stability– FFTs are sensitive to small errors– Repeated simultaneous FFT/iFFT– Parallel for testing large systems

• System– MPI contention

• Impact of GPU usage on MPI latency

– Chipset contention• Impact of MPI communication on GPU performance

– NUMA• Multi-socket, multiple PCIe slot, multiple RAM banks


Compare OpenCL and CUDA

• OpenCL improving, but still trailing CUDA

• Tesla C2050, CUDA\OpenCL 3.2 RC2

FFT

FFT DPMD

MD DP

SGEMM

DGEMM S3D

Reduction

Scan

Sort

SPMV-Vecto

r DP

SPMV-ELLR DP

01234567

5.33

6.39

1.201.69 1.49 1.74

0.99 1.02 1.26 1.031.99

1.40

CUDA Performance Relative to OpenCL


Example Results

ATI Radeon HD5870

NV GTX580

NV GTX480

Tesla M2070

NV Ion02468

101214

4.52

11.99

7.86.24

0.24

SP Sparse Mat-Vec Multiplication

GB/s


Reduction Walkthrough



• Fundamental kernel in almost all programs

• Easy to implement, but hard to get right

• We’ll walk through the optimization process.

• Code for these kernels is at http://ft.ornl.gov/~kspafford/tutorial.tgz

• Graphics from a similar presentation by Mark Harris, NVIDIA

http://ft.ornl.gov/~kspafford/tutorial.tgz



• Start with the well-known, tree-based approach:


Algorithm Sketch

• Launch 1 thread per element

• Each thread loads an element from global memory into shared memory

• Each block reduces its shared memory into 1 value

• This value is written back out to global memory


Algorithm Sketch with Code

• Main steps– Each thread loads a value from global memory into shared

memory

extern __shared__ float sdata[];

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[tid] = g_idata[i];– Synchronize threads__syncthreads();

– Reduce shared memory into a single value– Write value out to global memory


Reduction of Shared Memory

for(int s = 1; s < blockDim.x;

s *= 2) {

if (tid % (2*s) == 0) {

sdata[tid] += sdata[tid + s];

}

__syncthreads();

}


Reduction v0

• Problem 1: Divergent Warps

• Problem 2: Modulo operator is expensive on GPUs


Recursive Invocation

• Problem: We want a single value, but blocks can’t communicate

• Solution: Recursive kernel invocation


PerformanceKernel Version GB/s Step Speedup Total Speedup

0 – interleaved addressing, divergent warps

6.9 - -


Reduction v1

• Get rid of divergent branch and modulo operator


Reduction v1


Problem – Bank Conflicts• Shared memory is composed of 32 banks.

• When multiple threads access *different* words in the *same* bank, access is serialized




6.9 - -

1 – interleaved addressing, bank conflicts

10.9 1.58x 1.58x


Reduction v2 – Sequential Addressing




6.9 - -


10.9 1.58x 1.58x

2 – removed bank conflicts

14.0 1.28x 2.03x


Reduction v3 – Unrolling the Last Warp• We know threads execute in a warp-synchronous fashion

• For the last few steps, we can get rid of extra __syncthreads() calls




6.9 - -


10.9 1.58x 1.58x


14.0 1.28x 2.03x

3 – unrolled last warp

23.1 1.65x 3.34x


Reduction v4 – Multiple Elements Per Thread• Still have some instruction overhead

– Can use templates to totally unroll the loop– Can have threads handle multiple elements from global memory

• Bonus: reduces any array size to 2 kernel invocations

• This is a useful optimization for most kernels




6.9 - -


10.9 1.58x 1.58x


14.0 1.28x 2.03x

3 – unrolled last warp

23.1 1.65x 3.34x

4 – totally unrolled, multiple elems per thread

36.5 1.58x 5.29x


More about Reduction

• http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

• Demo

wget http://ft.ornl.gov/~kspafford/tutorial.tgz

http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf



https://ft.ornl.gov/~kspafford/tutorial.tgz

https://ft.ornl.gov/~kspafford/tutorial.tgz


Programming Problem - Scan

• Now let’s think about how this extends to Scan (aka prefix sum)

Scan takes a binary associative operator , and an array of n ⊕elements:

[a0, a1, …, an-1],

and returns the array

[a0, (a0 a⊕ 1), …, (a0 a⊕ 1 … a⊕ ⊕ n-1)].

Example: If is addition⊕

[3 1 7 0 4] [3 4 11 11 15]


Reduce-then-scan Strategy

7 3 8 5 5 1 2 6

Kernel 1: Reduce

10 13 6 8



7 3 8 5 5 1 2 6

Kernel 1: Reduce

10 13 6 8

Kernel 2: Exclusive Top-level scan

0 10 23 29



7 3 8 5 5 1 2 6

Kernel 1: Reduce

10 13 6 8

Kernel 2: Exclusive Top-level scan

0 10 23 29

7 10 18 23 28 29 31 37

Kernel 3: Bottom-level scan


Fast Scan Kernel• Use 2x shared memory as there are elements, set first half

to 0, second half to input.

10 13 6 80 0 0 0

10 23 19 140 0 0 0

10 23 29 370 0 0 0

for i=0; i < log2 blockSize; i++)

smem[idx] += smem[idx-2i];


Example Code

• Kernel Found in SHOC (src/level1/scan/scan_kernel.h) in the scanLocalMem function

• You can adapt this function for the top-level exclusive scan and the bottom-level inclusive scans.

• Problems:– Determine how reduction should stride across global memory– Figure out how to make it exclusive/inclusive (hint: remember the

first half of smem is 0)– Figure out how to use the scan kernel for the bottom level scan


Good Luck! Further reading on Scan:• Examples in the CUDA SDK

• http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf

• http://back40computing.googlecode.com/svn/wiki/documents/ParallelScanForStreamArchitecturesTR.pdf

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf

http://back40computing.googlecode.com/svn/wiki/documents/ParallelScanForStreamArchitecturesTR.pdf

http://back40computing.googlecode.com/svn/wiki/documents/ParallelScanForStreamArchitecturesTR.pdf


MD Walkthrough

• Motivation– Classic n-body pairwise

computation, important to all MD codes such as GPU-LAMMPS, AMBER, NAMD, Gromacs, Charmm

• Basic design– Computation of the Lennard

Jones potential force– 3D domain, random

distribution– Neighbor list algorithm


Algorithm Sketch

for each atom, i {

force = 0;

for each neighbor, j {

dist = distance(pos[i],pos[j]);

if (dist < cutoff)

force += interaction(i,j);

}

forces[i] = force;

}


Performance Observations

• Neighbors are data-dependent– Results in an uncoalesced read on pos[j].

• Uncoalesced reads kill performance– But sometimes the texture cache can help


(in cuda/level1/md/MD.cu)


Performance on Keeneland

12288 24576 36864 73728nAtom

0

10

20

30

40

50

60

70

80

90

74.47 77.55

65.48

27.37

LJ-SP Bandwidth

GB/s


For the Hands-On Session

• Scan Competition

• Goal: Best performance on 16 MiB input of floats.– Use this slide deck and knowledge from the other presentations– Test harness with timing and correctness check provided in

scan.cu– Download it from ft.ornl.gov/~kspafford/tutorial.tgz

• Email submissions (scan.cu) to [email protected]

• I will announce the winner at the end of the hands-on session.


[email protected]

Documents

Kyle Spafford Keeneland Tutorial April 14, 2011. 2Managed by UT-Battelle for the U.S. Department of Energy The Scalable Heterogeneous Computing Benchmark