Accelerating 3D Facial Modeling Using ArrayFire, OpenCV and …on-demand.gputechconf.com/gtc/2014/presentations/S4426... · 2014-05-21 · Umar Arshad Subject: This session will discuss

Accelerating 3D Facial Modeling using ArrayFire, OpenCV and CUDA

Umar Arshad (@arshad_umar)ArrayFire (@arrayfire)

ArrayFire

● World’s leading GPU experts○ In the industry since 2007○ NVIDIA Partner

● Deep experience working with thousands of customers○ Analysis○ Acceleration○ Algorithm development

● GPU Training○ Hands on course with a CUDA engineer○ Customized to meet your needs

Demo

http://www.youtube.com/watch?v=RlHOhYF_jqM

Problem

● Came to us with a slow application○ Made use of OpenCV and OpenMP○ 8 threads: 30+ seconds○ One process○ Developed on OSX

● Required a significant hardware investment○ Increased maintenance○ Financially not viable in production○ Had windows infrastructure

Improvements

● OpenCV - ArrayFire interop● Rendering using GPUs

○ Partial CUDA based estimation○ OpenGL based rendering

● Batching Operations○ Combining data into single operation

● Concurrent Processing○ CPU: small variable length data○ GPU: large fixed length data

Moving to ArrayFire

● OpenCV Mat to ArrayFire array○ Row vs. Column Major

○ http://blog.accelereyes.com/blog/2012/09/19/image-processing-with-arrayfire-and-opencv/

● Similar Interface○ Allowed for quick porting

http://blog.accelereyes.com/blog/2012/09/19/image-processing-with-arrayfire-and-opencv/



Rendering

● Software rasterization● Analysis of algorithm

○ Did not require an exact render

● ArrayFire based estimate○ Plot points○ Dilate

Rendering

● Moved to OpenGL for some cases○ Makes use of hardware rasterizer○ ArrayFire -> OpenGL interop using CUDA-OpenGL interop○ See ArrayFire GitHub for sample implementation

https://github.com/arrayfire

Batching

● Used OpenMP for parallelism○ One frame per thread○ Optimized for CPU

● One CPU thread + GPU○ Parallelism on GPU vs. Parallelism on CPU

● Combined OpenMP threads

Batching

● Many small operations○ Individually it didn’t make sense to port to the GPU

● Increase dimensionality of the data○ 2D -> 3D○ GFOR and Strided Access

● Moved to single threaded code

Batching

● Call custom CUDA kernels○ Special indexing

● Specialized Matrix Multiply○ ssyrk vs. gemm○ 2x faster○ concurrent execution using streams

float * bound = boundary.device<float>();kernel<<< threads, blocks >>>(bound, boundary.elements());

Batching

● Results○ 90ms -> 28ms on a GTX 690

● Other Improvements○ Overlapped pinned memory transfers○ Generic to Specialized matrix multiply○ Streams

Concurrent Computation

● Overlap CPU and GPU computation○ CPU handles variable length data sets one frame at a time○ GPU handles fixed length data sets all frames concurrently

#pragma omp sections

{

#pragma omp section

{

// GPU Code

}

#pragma omp section

{

// CPU Code

}

}

Results

● 1 Process (5 threads): 8 seconds● 6 Processes(2 threads): 22 seconds

Q & A

Documents

Accelerating 3D Facial Modeling Using ArrayFire, OpenCV and …on-demand.gputechconf.com/gtc/2014/presentations/S4426... · 2014-05-21 · Umar Arshad Subject: This session will discuss