How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

How to Combine OpenMP, Streams, and ArrayFire for Maximum Multi-GPU

Throughput

Shehzan Mohammed@shehzanm

@arrayfire

Outline

● Introduction to ArrayFire● Case Study 1: glasses.com● Case Study 2: Accelerate Diagnostics

ArrayFire

● World’s leading GPU experts○ In the industry since 2007○ NVIDIA Partner

● Deep experience working with thousands of customers○ Analysis○ Acceleration○ Algorithm development

● GPU Training○ Hands on course with a CUDA engineer○ Customized to meet your needs

ArrayFire

● Hundreds of parallel functions○ Targeting image processing, machine learning, etc.

● Support for multiple languages○ C/C++, Fortran, Java and R

● Linux, Windows, Mac OS X ● OpenGL based graphics● Based around one data structure● Just-in-Time (JIT)

○ Combine multiple operations into one kernel

● GFOR, the only data parallel loop

ArrayFire Functions

● Hundreds of parallel functions○ Building blocks (non-exhaustive)

■ Reductions, scan, sort■ Set operations■ Statistics■ Matrix operations■ Image processing■ Signal processing■ Sparse matrix■ Visualizations

Case 1:glasses.com

Case 1: Glasses.com

● 3D face reconstruction from images● Image and Coordinate geometry processing● Came to us with a slow application

○ Made use of OpenCV and OpenMP○ One thread per PC, 8 threads: 30+ seconds○ Developed on OSX

Case 1: Glasses.com

● Required a significant hardware investment○ Increased maintenance○ Financially not viable in production○ Had windows infrastructure

The challenge: Speed, Speed, and much more

Challenge 1: Multithreading

● Multithreading benefits CPU code○ Calling CUDA from multiple threads may not offer too much benefit

■ Overheads of memory management, kernel launches■ Streams, pinned mem, etc required to harness full potential

○ GPU parallelism is faster than CPU for these operations

● Goal:○ Make host code single threaded○ Move all multithreaded sections to GPU

Challenge 1: Multithreading

● Most multithreaded sections easily ported○ Images are easy to combine and operate on at once○ ArrayFire gfor

● Some sections more difficult○ Require serial access and/or complex operations○ Needs to be run on host - more memory transfers○ Need combination of OpenMP and ArrayFire

Challenge 2: Multithreading ArrayFire

● ArrayFire was not thread safe○ Designed for GPU performance○ Required substantial work

■ iterative process○ Trade-offs

■ Cost of adding critical section vs ■ Cost of adding multithreading support

○ Limiting access to data for each thread

Challenge 2: Multithreading ArrayFire

● Required adding critical sections around key operations

● Constant memory and textures○ No way to make this thread safe

■ Except critical section○ Add critical section vs use global memory

■ Analyse and customize for specific operation

Challenge 3: Batching

● Image operations can be easily batched○ Most operations work on pixel-level or neighborhood-level

● Problem when operations are more complex● Batching does not always map

○ like affine matrix multiplications○ Indexing needs to be changes - expensive memcpys


● Used OpenMP for parallelism○ One frame per thread○ Optimized for CPU

● One CPU thread + GPU○ Parallelism on GPU vs. Parallelism on CPU

● Combined OpenMP threads


● Many small operations○ Individually it didn’t make sense to port to the GPU

● Increase dimensionality of the data○ 2D -> 3D○ GFOR and Strided Access

● Moved to single threaded code


● Call custom CUDA kernels○ Special indexing

● Specialized Matrix Multiply○ ssyrk vs. gemm○ 2x faster○ concurrent execution using streams

float * bound = boundary.device<float>();kernel<<< threads, blocks >>>(bound, boundary.elements());


● Results○ 90ms -> 28ms on a GTX 690

● Other Improvements○ Overlapped pinned memory transfers○ Generic to Specialized matrix multiply○ Streams

Concurrent Computation

● Overlap CPU and GPU computation○ CPU handles variable length data sets one frame at a time○ GPU handles fixed length data sets all frames concurrently

#pragma omp sections

{

#pragma omp section

{

// GPU Code

}

#pragma omp section

{

// CPU Code

}

}

Results

● 1 Process (5 threads): 8 seconds● 6 Processes(2 threads): 22 seconds● Demo

Demo

http://www.youtube.com/watch?v=RlHOhYF_jqM

Case 2: Accelerate Diagnostics


● Multithreaded Java code with CUDA integration● Image processing of large images (4k x 4k each)● Port to C++● Hard time constraint● Hard reliability constraint

The challenge: Maximize PCIe throughput○ Image processing is very parallel○ Memory transfer is majority of application run time


● Target hardware:○ Intel Xeon CPU○ 2 GTX Titans per system○ 64 GB RAM

● Required speed up: ~5x● Required reliability: 48 hour stress test

The Framework

● Master thread - scheduling and management● Slave threads - each handling 1 ‘pipeline’● Each pipeline handled one ‘site’ at a time● Continuous execution

● Pipeline - serial flow of execution for one site● “Rabbit Hole”

● Site - independent data set of images● Rabbit

The Framework - Initial

Master Thread

GPU 0 GPU 1

Pipe 1

Site

Pipe 0

Site

Site Database

Reads

● Minimalist● Initializes and controls pipeline● Feeds sites to pipelines

Master Thread

Master Thread

GPU 0 GPU 1

Thread 1

Pipe 1

Site

Thread 0

Pipe 0

Site

Pipeline

● Serial execution within pipelines● Processes one site at a time

Pipe 0

Site

Challenge 1: CPU Parallelism

How to parallelize pipelines independently?● Each thread processes one pipeline

○ At pipeline level, application is single threaded

● Allot one GPU to each pipeline● Pipelines initialized once per run ● Perpetual execution

Parallelism: Results

● On CPU side, worked fine● GPU - not so much

○ Too many blocking syncs to allocate and deallocate memory○ Copy/Kernel execution collisions between threads○ No concurrency○ Extremely slow memory transfer speeds

■ Each image is 16mb, multiple transfers per kernel call

○ Although pragmatically parallel, execution was almost serial, probably slower

Parallelism: Results

● On CPU side, worked fine● GPU - not so much

○ Too many blocking syncs to allocate and deallocate memory○ Copy/Kernel execution collisions between threads○ No concurrency○ Extremely slow memory transfer speeds

■ Each image is 16mb, multiple transfers per kernel call

○ Although pragmatically parallel, execution was almost serial, probably slower

Host

Challenge 2A: Pinned Memory

Pageable

Pinned

Device

DRAM

Host

Pageable

Pinned

Device

DRAM

Pageable Memory Copy Pinned Memory Copy

Transfer speeds can double with Pinned Memory● For pageable memory, CUDA first transfers to pinned

and then to GPU.● Non-pageable (pinned) memory not pageable by OS● CUDA skips pageable->pinned memory transfer

Challenge 2B: Streams

Concurrency, concurrency, concurrency● Increases PCIe throughput

○ Streams allow simultaneous copy and execution○ Together with pinned memory, allows asynchronous copy

● Each pipeline has one stream allotted to it○ Stream remains active through the lifetime of the pipeline

○ All CUDA operations and kernel launches are done asynchronously for the stream

○ Use of cudaStreamSynchronize (vs cudaDeviceSynchronize)

Pinned Memory and Streams: Results

● Memory transfers speeds increased ~2x● Problems:

○ Allocating/freeing pinned memory is a full system block■ All threads on CPU, all streams on GPU are blocked■ Very very bad■ Benefits of using streams is negated

○ Device memory alloc/free is also a blocking sync○ Too many memory API calls negating the benefits of parallelism○ Possible memory leaks - very bad for reliability

■ Will reveal in stress testing

Challenge 3: Better Memory Management

Minimize number of memory allocation and deletion● On CPU and GPU● The memory used in the processing of each site is

deterministic and constant

● Solution: Create a memory manager

Memory Manager

● Goals:○ Manage host and device memory for each pipeline○ Allocate, free memory○ Assign, retract memory○ Manage transfers between host and device○ Ensure consistency between host and device memory○ Free memory only at the end of the application

Memory Manager

Memory Manager “Mirrored Array”

Type Stream

SizeFlags

Host Pointer

Device Pointer

Create()

Push() Pull()

Update()

Free()

Dev Mem Host Mem

Release()

Memory Manager

● Memory usage is deterministic, memory once allocated can be reused as needed.○ After 1st site run, no new pinned or device memory will need to be

created○ Most of the memory required can be created in initialization○ Same chunk of memory can be reused using pointers

○ Pointers can release the memory back to manager when processing is completed

Better Memory Management: Results

● Drastic reduction in alloc/free calls● Much better parallelism

○ Streams are much more concurrent as blocks are reduced○ CPU threads do not need to be synced

● Stable across multiple site processing● Memory leaks are easily discovered

○ Increase in usage after 1st run shows leaks

○ Memory can be used to make sure all memory is release at the end of each site

The Framework

Master Thread

GPU 0 GPU 1

Pipe 1

Stream 1

Memory

Site

Pipe 0

Stream 0

Memory

Site

Results

● Significant performance improvements

● Excellent PCIe throughput● Highly parallel

● GPU kernel execution lower compared to memcpy times

Master Thread

GPU 0 GPU 1

Pipe 1

Stream 1

Memory

Site

Pipe 0

Stream 0

Memory

Site

● Increase pipelines to 4○ 2 per GPU

● 4 pipelines good for CPU○ 4 heavy processing threads○ 1 master light thread○ 4 threads = optimal usage

The Framework - Final

Master Thread

Pipe 3

Stream 3

Memory

GPU 0 GPU 1

Site

Pipe 2

Stream 2

Memory

Site

Pipe 1

Stream 1

Memory

Site

Pipe 0

Stream 0

Memory

Site

Results

● Improvement in times○ Almost 2x better than required

● Stable memory usage● GPU usage optimal● Problems?

Master Thread

Pipe 3

Stream 3

Memory

GPU 0 GPU 1

Site

Pipe 2

Stream 2

Memory

Site

Pipe 1

Stream 1

Memory

Site

Pipe 0

Stream 0

Memory

Site

Results

● Improvement in times○ Almost 2x better than required

● Stable memory usage● GPU usage optimal● Problem: OVERHEATING!

Master Thread

Pipe 3

Stream 3

Memory

GPU 0 GPU 1

Site

Pipe 2

Stream 2

Memory

Site

Pipe 1

Stream 1

Memory

Site

Pipe 0

Stream 0

Memory

Site

Results

● Problem: OVERHEATING!● Solution:

○ Use software tools to lower gpu clock speeds○ Control fan speeds on gpu○ Create target power and temperature

● No major reduction in performance

Case 2: Takeaways

● Application is only as fast as its slowest part● True multithreading is awesome

○ Not easy - but can be done

● Memory management is crucial to parallelism● Be ready to tackle any problem

○ Overheating? Really?

Q & A

Documents

How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually