52
How to Combine OpenMP, Streams, and ArrayFire for Maximum Multi-GPU Throughput Shehzan Mohammed @shehzanm @arrayfire

How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

How to Combine OpenMP, Streams, and ArrayFire for Maximum Multi-GPU

Throughput

Shehzan Mohammed@shehzanm

@arrayfire

Page 2: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Outline

● Introduction to ArrayFire● Case Study 1: glasses.com● Case Study 2: Accelerate Diagnostics

Page 3: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

ArrayFire

● World’s leading GPU experts○ In the industry since 2007○ NVIDIA Partner

● Deep experience working with thousands of customers○ Analysis○ Acceleration○ Algorithm development

● GPU Training○ Hands on course with a CUDA engineer○ Customized to meet your needs

Page 4: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

ArrayFire

● Hundreds of parallel functions○ Targeting image processing, machine learning, etc.

● Support for multiple languages○ C/C++, Fortran, Java and R

● Linux, Windows, Mac OS X ● OpenGL based graphics● Based around one data structure● Just-in-Time (JIT)

○ Combine multiple operations into one kernel

● GFOR, the only data parallel loop

Page 5: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

ArrayFire Functions

● Hundreds of parallel functions○ Building blocks (non-exhaustive)

■ Reductions, scan, sort■ Set operations■ Statistics■ Matrix operations■ Image processing■ Signal processing■ Sparse matrix■ Visualizations

Page 6: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Case 1:glasses.com

Page 7: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Case 1: Glasses.com

● 3D face reconstruction from images● Image and Coordinate geometry processing● Came to us with a slow application

○ Made use of OpenCV and OpenMP○ One thread per PC, 8 threads: 30+ seconds○ Developed on OSX

Page 8: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Case 1: Glasses.com

● Required a significant hardware investment○ Increased maintenance○ Financially not viable in production○ Had windows infrastructure

The challenge: Speed, Speed, and much more

Page 9: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Challenge 1: Multithreading

● Multithreading benefits CPU code○ Calling CUDA from multiple threads may not offer too much benefit

■ Overheads of memory management, kernel launches■ Streams, pinned mem, etc required to harness full potential

○ GPU parallelism is faster than CPU for these operations

● Goal:○ Make host code single threaded○ Move all multithreaded sections to GPU

Page 10: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Challenge 1: Multithreading

● Most multithreaded sections easily ported○ Images are easy to combine and operate on at once○ ArrayFire gfor

● Some sections more difficult○ Require serial access and/or complex operations○ Needs to be run on host - more memory transfers○ Need combination of OpenMP and ArrayFire

Page 11: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Challenge 2: Multithreading ArrayFire

● ArrayFire was not thread safe○ Designed for GPU performance○ Required substantial work

■ iterative process○ Trade-offs

■ Cost of adding critical section vs ■ Cost of adding multithreading support

○ Limiting access to data for each thread

Page 12: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Challenge 2: Multithreading ArrayFire

● Required adding critical sections around key operations

● Constant memory and textures○ No way to make this thread safe

■ Except critical section○ Add critical section vs use global memory

■ Analyse and customize for specific operation

Page 13: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Challenge 3: Batching

● Image operations can be easily batched○ Most operations work on pixel-level or neighborhood-level

● Problem when operations are more complex● Batching does not always map

○ like affine matrix multiplications○ Indexing needs to be changes - expensive memcpys

Page 14: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Challenge 3: Batching

● Used OpenMP for parallelism○ One frame per thread○ Optimized for CPU

● One CPU thread + GPU○ Parallelism on GPU vs. Parallelism on CPU

● Combined OpenMP threads

Page 15: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Challenge 3: Batching

● Many small operations○ Individually it didn’t make sense to port to the GPU

● Increase dimensionality of the data○ 2D -> 3D○ GFOR and Strided Access

● Moved to single threaded code

Page 16: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually
Page 17: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually
Page 18: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Challenge 3: Batching

● Call custom CUDA kernels○ Special indexing

● Specialized Matrix Multiply○ ssyrk vs. gemm○ 2x faster○ concurrent execution using streams

float * bound = boundary.device<float>();kernel<<< threads, blocks >>>(bound, boundary.elements());

Page 19: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Challenge 3: Batching

● Results○ 90ms -> 28ms on a GTX 690

● Other Improvements○ Overlapped pinned memory transfers○ Generic to Specialized matrix multiply○ Streams

Page 20: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Concurrent Computation

● Overlap CPU and GPU computation○ CPU handles variable length data sets one frame at a time○ GPU handles fixed length data sets all frames concurrently

#pragma omp sections

{

#pragma omp section

{

// GPU Code

}

#pragma omp section

{

// CPU Code

}

}

Page 21: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Results

● 1 Process (5 threads): 8 seconds● 6 Processes(2 threads): 22 seconds● Demo

Page 23: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Case 2: Accelerate Diagnostics

Page 24: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Case 2: Accelerate Diagnostics

● Multithreaded Java code with CUDA integration● Image processing of large images (4k x 4k each)● Port to C++● Hard time constraint● Hard reliability constraint

The challenge: Maximize PCIe throughput○ Image processing is very parallel○ Memory transfer is majority of application run time

Page 25: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Case 2: Accelerate Diagnostics

● Target hardware:○ Intel Xeon CPU○ 2 GTX Titans per system○ 64 GB RAM

● Required speed up: ~5x● Required reliability: 48 hour stress test

Page 26: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

The Framework

● Master thread - scheduling and management● Slave threads - each handling 1 ‘pipeline’● Each pipeline handled one ‘site’ at a time● Continuous execution

● Pipeline - serial flow of execution for one site● “Rabbit Hole”

● Site - independent data set of images● Rabbit

Page 27: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

The Framework - Initial

Master Thread

GPU 0 GPU 1

Pipe 1

Site

Pipe 0

Site

Site Database

Reads

Page 28: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

● Minimalist● Initializes and controls pipeline● Feeds sites to pipelines

Master Thread

Master Thread

GPU 0 GPU 1

Thread 1

Pipe 1

Site

Thread 0

Pipe 0

Site

Page 29: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Pipeline

● Serial execution within pipelines● Processes one site at a time

Pipe 0

Site

Page 30: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Challenge 1: CPU Parallelism

How to parallelize pipelines independently?● Each thread processes one pipeline

○ At pipeline level, application is single threaded

● Allot one GPU to each pipeline● Pipelines initialized once per run ● Perpetual execution

Page 31: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Parallelism: Results

● On CPU side, worked fine● GPU - not so much

○ Too many blocking syncs to allocate and deallocate memory○ Copy/Kernel execution collisions between threads○ No concurrency○ Extremely slow memory transfer speeds

■ Each image is 16mb, multiple transfers per kernel call

○ Although pragmatically parallel, execution was almost serial, probably slower

Page 32: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Parallelism: Results

● On CPU side, worked fine● GPU - not so much

○ Too many blocking syncs to allocate and deallocate memory○ Copy/Kernel execution collisions between threads○ No concurrency○ Extremely slow memory transfer speeds

■ Each image is 16mb, multiple transfers per kernel call

○ Although pragmatically parallel, execution was almost serial, probably slower

Page 33: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Host

Challenge 2A: Pinned Memory

Pageable

Pinned

Device

DRAM

Host

Pageable

Pinned

Device

DRAM

Pageable Memory Copy Pinned Memory Copy

Transfer speeds can double with Pinned Memory● For pageable memory, CUDA first transfers to pinned

and then to GPU.● Non-pageable (pinned) memory not pageable by OS● CUDA skips pageable->pinned memory transfer

Page 34: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Challenge 2B: Streams

Concurrency, concurrency, concurrency● Increases PCIe throughput

○ Streams allow simultaneous copy and execution○ Together with pinned memory, allows asynchronous copy

● Each pipeline has one stream allotted to it○ Stream remains active through the lifetime of the pipeline

○ All CUDA operations and kernel launches are done asynchronously for the stream

○ Use of cudaStreamSynchronize (vs cudaDeviceSynchronize)

Page 35: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Pinned Memory and Streams: Results

● Memory transfers speeds increased ~2x● Problems:

○ Allocating/freeing pinned memory is a full system block■ All threads on CPU, all streams on GPU are blocked■ Very very bad■ Benefits of using streams is negated

○ Device memory alloc/free is also a blocking sync○ Too many memory API calls negating the benefits of parallelism○ Possible memory leaks - very bad for reliability

■ Will reveal in stress testing

Page 36: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Challenge 3: Better Memory Management

Minimize number of memory allocation and deletion● On CPU and GPU● The memory used in the processing of each site is

deterministic and constant

● Solution: Create a memory manager

Page 37: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Memory Manager

● Goals:○ Manage host and device memory for each pipeline○ Allocate, free memory○ Assign, retract memory○ Manage transfers between host and device○ Ensure consistency between host and device memory○ Free memory only at the end of the application

Page 38: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Memory Manager

Memory Manager “Mirrored Array”

Type Stream

SizeFlags

Host Pointer

Device Pointer

Create()

Push() Pull()

Update()

Free()

Dev Mem Host Mem

Release()

Page 39: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Memory Manager

● Memory usage is deterministic, memory once allocated can be reused as needed.○ After 1st site run, no new pinned or device memory will need to be

created○ Most of the memory required can be created in initialization○ Same chunk of memory can be reused using pointers

○ Pointers can release the memory back to manager when processing is completed

Page 40: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Better Memory Management: Results

● Drastic reduction in alloc/free calls● Much better parallelism

○ Streams are much more concurrent as blocks are reduced○ CPU threads do not need to be synced

● Stable across multiple site processing● Memory leaks are easily discovered

○ Increase in usage after 1st run shows leaks

○ Memory can be used to make sure all memory is release at the end of each site

Page 41: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

The Framework

Master Thread

GPU 0 GPU 1

Pipe 1

Stream 1

Memory

Site

Pipe 0

Stream 0

Memory

Site

Page 42: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Results

● Significant performance improvements

● Excellent PCIe throughput● Highly parallel

● GPU kernel execution lower compared to memcpy times

Master Thread

GPU 0 GPU 1

Pipe 1

Stream 1

Memory

Site

Pipe 0

Stream 0

Memory

Site

Page 43: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

● Increase pipelines to 4○ 2 per GPU

● 4 pipelines good for CPU○ 4 heavy processing threads○ 1 master light thread○ 4 threads = optimal usage

The Framework - Final

Master Thread

Pipe 3

Stream 3

Memory

GPU 0 GPU 1

Site

Pipe 2

Stream 2

Memory

Site

Pipe 1

Stream 1

Memory

Site

Pipe 0

Stream 0

Memory

Site

Page 44: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually
Page 45: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually
Page 46: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually
Page 47: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually
Page 48: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Results

● Improvement in times○ Almost 2x better than required

● Stable memory usage● GPU usage optimal● Problems?

Master Thread

Pipe 3

Stream 3

Memory

GPU 0 GPU 1

Site

Pipe 2

Stream 2

Memory

Site

Pipe 1

Stream 1

Memory

Site

Pipe 0

Stream 0

Memory

Site

Page 49: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Results

● Improvement in times○ Almost 2x better than required

● Stable memory usage● GPU usage optimal● Problem: OVERHEATING!

Master Thread

Pipe 3

Stream 3

Memory

GPU 0 GPU 1

Site

Pipe 2

Stream 2

Memory

Site

Pipe 1

Stream 1

Memory

Site

Pipe 0

Stream 0

Memory

Site

Page 50: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Results

● Problem: OVERHEATING!● Solution:

○ Use software tools to lower gpu clock speeds○ Control fan speeds on gpu○ Create target power and temperature

● No major reduction in performance

Page 51: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Case 2: Takeaways

● Application is only as fast as its slowest part● True multithreading is awesome

○ Not easy - but can be done

● Memory management is crucial to parallelism● Be ready to tackle any problem

○ Overheating? Really?

Page 52: How to Combine OpenMP, Streams, and ArrayFire …on-demand.gputechconf.com/gtc/2014/presentations/S4386...Combined OpenMP threads Challenge 3: Batching Many small operations Individually

Q & A