Lab Project GPU Programming SS12 - TUM · GPU Programming Lab Project Optical Flow and Super...

GPU ProgrammingLab Project

Optical Flow and Super Resolution

Ross Kidson 03627521Oliver Dunkley 03631802

Introduction

Contents○ Implementation environment, methods○ Optical Flow

■ example■ theory■ implementation■ results & performance

○ Super Resolution■ theory■ implementation■ results

Implementation

● nsight, ubuntu● two code bases● focused on benchmarking/comparing

memory types● python scripts & open office

Benchmarking methods

● GPU - NVidia Visual Profiler● CPU - custom 'benchmark' class to reproduce the same

Benchmark::instance()->addEvent(Benchmark::start, "add_flow_fields"); for(unsigned int p=0;p<nx_fine*ny_fine;p++){

_u1[p] += _u1lvl[p];_u2[p] += _u2lvl[p];

}Benchmark::instance()->addEvent(Benchmark::end, "add_flow_fields");

Benchmark::instance()->doBenchmark(); //Save collected data to file

Optical flow

"Pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and the scene"

Usages● motion detection● object segmentation● time-to-collision ● motion compensated encoding● stereo disparity measurement

Optical flow: Example input street 1

Optical flow: Example input street 2

Optical flow: Example output

Optical flow

Optical flow Example input Tron 1

Optical flow Example input Tron 2

Optical flow Overlayed output

Optical flow: Formulation

Given two images I1 and I2 one computes a field vector u that matches image intensities:

Formulate as an energy function to minimize

Optical flow: Formulation

Apply taylor series expansion and robust penalty term:

This can be solved with Euler-Lagrange equations and reformulated into ax = b form, which can then be solved using SOR

resampleAreaParallel(u(x)k-1 )

backwardRegistrationBilinearFunctionTex(I2 , u(x)

setKernel(du(x),0)

sorflow_update_robustifications_warp_tex(u(x), du(x))

sorflow_update_righthandside_tex(u(x), b

sorflow_nonlinear_warp_sor_tex(b, ) du(x)

addKernel(u(x),du(x))

inner iteration (SOR) loop

outer iteration (update robustifications) loop

next pyramid level loop

coarse --> fineOutput: u(x)

1. Resize initial u(x) to current level

2. Warp the original image at this level by resized flow field u(x)

3. Set du to 0

4. Update robustification terms based on current u(x) and du(x)

5. Update right hand side of equation to solve - b term

6. Solve for du(x)

7. Update robustifications and b and resolve for du(x)

8. add du to u

9. Continue to next pyramid level

I2_warp

Optical Flow ResultsOuter Iterations

x163x162

x137x104

Applying GPU Techniques

● registers

● constant memory

● shared memory

● texture memory

● global memory

● kernels, blocks, pitch, warp

Variables tested

Already in Texture Memory● Image gradients Ix, Iy, It● backwardRegistration - I2

Added to Texture Memory● ● u(x)

Added to Shared Memory● u(x)● du(x)

resampleAreaParallel(u(x)k-1 )

backwardRegistrationBilinearFunctionTex(I2 , u(x))

setKernel(du(x),0)

sorflow_update_robustifications_warp_tex(u(x), du(x))

sorflow_update_righthandside_tex(u(x), )

sorflow_nonlinear_warp_sor_tex(b, )

addKernel(u(x),du(x))

inner iteration (SOR) loop

outer iteration (update robustifications) loop

next pyramid level loop

Output: u(x)

Optical Flow Results

GPU Techniques - Small Hack 1

if (x == 0) shared_mem[0][ty] = shared_mem[tx][ty]; //one at a timevs

if (x == 0){ shared_mem1[0][ty] = shared_mem1[tx][ty]; //1 to N in parallel shared_memN[0][ty] = shared_mem2[tx][ty]; }

Optical Flow Results

Super Resolution: Formulation

Given a number of degraded observations of image I that have been transformed by linear operations , a set of linear equations can be obtained:

Super Resolution: Formulation

Energy function with regularity penalty term:

Super Resolution: General Method

Super Res image I

Lower resolution images In

I1 I2 I3 I4 I6 I7 I8 I9I

1. Warp inital guess image I to I1

(BackwardRegistation)

Iwarped

I2 I3 I4 I6 I7 I8 I9I

2. Shrink warped image and perform gaussian blur

Iwarped

I2 I3 I4 I6 I7 I8 I9I

3. Subtract images

Iwarped

I2 I3 I4 I6 I7 I8 I9I

4. Resample difference image back to larger resolution

Idiff resampled

I2 I3 I4 I6 I7 I8 I9I

5. Warp back to middle image.

(ForwardRegistration)

Idiff resampled warped

I2 I3 I4 I6 I7 I8 I9I

6. Repeat for all other images and add all difference images together

I2 diff resampled warped

In diff resampled warped

I2 I3 I4 I6 I7 I8 I9I

7. Repeat the entire process until image converges

In diff resampled warped

I2 I3 I4 I6 I7 I8 I9I

Image loop 1:Calculate diff images

For each image In:

backwardRegistrationBilinearValueTex(uor(x), u(x))

gaussBlurSeparateMirrorGpu() [const memory!]

resampleAreaParallelSeparate()

dualL1Difference()

Results in a vector of original sized difference images

shrink

Image loop 2: Warp difference images forward

For each difference image

resampleAreaParallelSeparateAdjoined()

gaussBlurSeparateMirrorGpu()

forewardRegistrationBilinearAtomic()

addKernel(diff_image, temp_accumulator)

Results in an image of all summed differences together

calculate for # number of outer iterations

Putting it together (for one image)

Image loop 1: Calculate difference Images

Image loop 2: Morph and sum difference images

xi1, xi2dualTVHuber(uor)

Final Result

primal1N(xi1,xi2,difference_sum,u,uor) uor(x), u(x)

setKernel(differnce_sum,0)

Super Resolution Output &Cuda Tex2d(...) differences

differencemanual cuda tex2d(...)original

Super Resolution GPU Performance

22x Speed-up

Super Resolution Performance

GPU Techniques - Small Hacks 2

const int tx_1 = tx == 0 ? tx : tx - 1;vs

const int tx_1 = tx - 1 * (x > 0);

Debugging techniques: Compare Image class

● Automatically compare cpu to gpu images○ Compares pixel values○ Errors not always visible

● Outputs which failed, stats, display image differences● Facilitates code testing after optimizations/hacks● Problem with texture memory (low similarity thresholds)

ImageComparison* ic = ImageComparison::instance();

CPU:ic->addImage(_I1pyramid->level[rec_depth], "CI1",rec_depth, nx_fine, ny_fine,1, "CPU");ic->addImage(_I2pyramid->level[rec_depth], "CI2",rec_depth, nx_fine, ny_fine,1, "CPU");ic->dumpData("cpu_images"); //save data to disk....

GPU:ic->addImage(_I1pyramid->level[rec_depth], "CI1", rec_depth,nx_fine, ny_fine, pitch,"GPU");ic->addImage(_I2pyramid->level[rec_depth], "CI2", rec_depth,nx_fine, ny_fine, pitch,"GPU");....ImageComparison::instance()->compareImages("cpu_images");

Difficulties

● Diving into the code● Compiling on a local system● Debugging segaults (printf, __LINE__, __FILE__ helped...)

● Texture offsets● Incorrect in/out pitches● Forgetting to catchkernel (oops)● Evaluating massive amounts of data● Maintaining two code bases● Combining benchmark data

References:

Nils Papenberg, Andres Bruhn, Thomas Brox, Stephan Didas, and Joachim Weickert. 2006. Highly Accurate Optic Flow Computation with Theoretically Justified Warping. Int. J. Comput. Vision 67, 2 (April 2006), 141-158.

Markus Unger, Thomas Pock, Manuel Werlberger, and Horst Bischof. 2010. A convex approach for variational super-resolution. In Proceedings of the 32nd DAGM conference on Pattern recognition, Michael Goesele, Stefan Roth, Arjan Kuijper, Bernt Schiele, and Konrad Schindler (Eds.). Springer-Verlag, Berlin, Heidelberg, 313-322.

NVIDIA CUDA C Best Practices Guide DG-05603-001_v4.1 | January 2012

Lab Project GPU Programming SS12 - TUM · GPU Programming Lab Project Optical Flow and Super...

Documents

CS179: GPU Programming

Towards Composable GPU Programming

CUDA GPU Programming

GPU Programming 360iDev

GPU Programming

GPU Architecture & CUDA Programming

GPU Programming “Languages”

GPU Architecture and Programming. GPU vs CPU

Mathematica for GPU Programming

GPU Programming Yanci Zhang Game Programming Practice

GPU Programming Guide G80

Multi-GPU Programming - GPU Technology Conference

Intro to GPU Programming

GPU Programming (2)

GPU Programming Using CUDA

PyCUDA: Even Simpler GPU Programming with Pythonon-demand.gputechconf.com/gtc/.../S12041...Python.pdf · GPU ScriptingPyOpenCLNewsRTCGShowcase PyCUDA: Even Simpler GPU Programming

GPU Programming Paradigms

Programming Heterogeneous (GPU) Systems

Lecture 11: GPU programming

CS 179: GPU Programming