Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Accelerating 3D Facial Modeling using ArrayFire, OpenCV and CUDA
Umar Arshad (@arshad_umar)ArrayFire (@arrayfire)
ArrayFire
● World’s leading GPU experts○ In the industry since 2007○ NVIDIA Partner
● Deep experience working with thousands of customers○ Analysis○ Acceleration○ Algorithm development
● GPU Training○ Hands on course with a CUDA engineer○ Customized to meet your needs
Problem
● Came to us with a slow application○ Made use of OpenCV and OpenMP○ 8 threads: 30+ seconds○ One process○ Developed on OSX
● Required a significant hardware investment○ Increased maintenance○ Financially not viable in production○ Had windows infrastructure
Improvements
● OpenCV - ArrayFire interop● Rendering using GPUs
○ Partial CUDA based estimation○ OpenGL based rendering
● Batching Operations○ Combining data into single operation
● Concurrent Processing○ CPU: small variable length data○ GPU: large fixed length data
Moving to ArrayFire
● OpenCV Mat to ArrayFire array○ Row vs. Column Major
○ http://blog.accelereyes.com/blog/2012/09/19/image-processing-with-arrayfire-and-opencv/
● Similar Interface○ Allowed for quick porting
Rendering
● Software rasterization● Analysis of algorithm
○ Did not require an exact render
● ArrayFire based estimate○ Plot points○ Dilate
Rendering
● Moved to OpenGL for some cases○ Makes use of hardware rasterizer○ ArrayFire -> OpenGL interop using CUDA-OpenGL interop○ See ArrayFire GitHub for sample implementation
Batching
● Used OpenMP for parallelism○ One frame per thread○ Optimized for CPU
● One CPU thread + GPU○ Parallelism on GPU vs. Parallelism on CPU
● Combined OpenMP threads
Batching
● Many small operations○ Individually it didn’t make sense to port to the GPU
● Increase dimensionality of the data○ 2D -> 3D○ GFOR and Strided Access
● Moved to single threaded code
Batching
● Call custom CUDA kernels○ Special indexing
● Specialized Matrix Multiply○ ssyrk vs. gemm○ 2x faster○ concurrent execution using streams
float * bound = boundary.device<float>();kernel<<< threads, blocks >>>(bound, boundary.elements());
Batching
● Results○ 90ms -> 28ms on a GTX 690
● Other Improvements○ Overlapped pinned memory transfers○ Generic to Specialized matrix multiply○ Streams
Concurrent Computation
● Overlap CPU and GPU computation○ CPU handles variable length data sets one frame at a time○ GPU handles fixed length data sets all frames concurrently
#pragma omp sections
{
#pragma omp section
{
// GPU Code
}
#pragma omp section
{
// CPU Code
}
}
Results
● 1 Process (5 threads): 8 seconds● 6 Processes(2 threads): 22 seconds
Q & A