1
Depth estimation is an important problem in the fields of 3DTV, Virtual Reality and in Autonomous vehicles Conventional image processing algorithms don’t produce satisfactory results for complex, real world scenes We explore the use of CNNs to tackle this problem Using CNNs to Estimate Depth from Stereo Imagery Tyler Jordan {tsjordan}, Skanda Shridhar {sshridha}, Jayant Thatte {[email protected]} Department of Electrical Engineering, Stanford University Motivation Postprocessing [1, 4] Experimental Results CNN Matching Cost Minimize Disparity Map Cross-Based Cost Aggregation Energy Constraints Occlusion Interpolation Major objects in the scenes like the road, signs, and cars are accurate in the disparity maps. The right and left edges are not as clean as the center of the image due to the lack of redundant data. The CNN approach performs far better than the naïve plane-sweep approach. 9x9 patches from left and right images Support region (red) created by union of horizontal crosses along the vertical cross. The cross length are determined by intensity difference and length constraints. This allows for context-based blurring Regions occluded in the left image (blue) are filled in with data from the right (red) Cross-Based Cost Aggregation Occlusion Interpolation L R Cost CBCA Holes in depth map are filled with interpolation using bidirectional matching Regions where the right and left depth map don’t agree after occlusion interpolation are filled by the median of the closest good pixels in 16 directions References [1] Zhang et. al. "Cross-based local stereo matching using orthogonal integral images." Circuits and Systems for Video Technology, IEEE Transactions on 19.7 (2009) [2] Kim et. al. "3D scene reconstruction from multiple spherical stereo pairs." International journal of computer vision 104.1 (2013) Convolutional Neural Network L1 Filters Inputs: 1 million matching & non-matching image patches are fed into the network Output: Stereo matching cost, per pixel Training: The 1 st layer is convolutional; all remaining layers are fully connected Testing: All fully connected layers are expressed as conv. layers so that the entire test image can be processed at once Platform: Caffe w/ CuDNN GPU acceleration [3] Kim et al. "Dynamic 3d scene reconstruction in outdoor environments." In Proc. IEEE Symp. on 3D Data Processing and Visualization. 2010. [4] Žbontar et. al. "Computing the stereo matching cost with a convolutional neural network." arXiv preprint arXiv:1409.4326 (2014).

Using CNNs to Estimate Depth from Stereo Imagerycs229.stanford.edu/proj2015/178_poster.pdf•Depth estimation is an important problem in the fields of 3DTV, Virtual Reality and in

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using CNNs to Estimate Depth from Stereo Imagerycs229.stanford.edu/proj2015/178_poster.pdf•Depth estimation is an important problem in the fields of 3DTV, Virtual Reality and in

• Depth estimation is an important problem in the fields of3DTV, Virtual Reality and in Autonomous vehicles

• Conventional image processing algorithms don’t producesatisfactory results for complex, real world scenes

• We explore the use of CNNs to tackle this problem

Using CNNs to Estimate Depth from Stereo ImageryTyler Jordan {tsjordan}, Skanda Shridhar {sshridha}, Jayant Thatte {[email protected]}

Department of Electrical Engineering, Stanford University

Motivation Postprocessing[1, 4]

Experimental Results

CNN

Matching Cost

Min

imize

Disparity MapCross-Based

Cost Aggregation

Energy Constraints

Occlusion Interpolation

Major objects in the scenes like the road, signs, and cars are accurate in the disparity maps. The right and left edges are not as clean as the center of theimage due to the lack of redundant data. The CNN approach performs far better than the naïve plane-sweep approach.

9x9 patches from left and right images

Support region (red) created by union of horizontalcrosses along the vertical cross. The cross length aredetermined by intensity difference and lengthconstraints. This allows for context-based blurring

Regions occluded in theleft image (blue) arefilled in with data fromthe right (red)

Cross-Based Cost Aggregation Occlusion Interpolation

L

R

Cost

CBCA

Holes in depth map arefilled with interpolationusing bidirectional matching

Regions where the right andleft depth map don’t agreeafter occlusion interpolationare filled by the median ofthe closest good pixels in 16directions

References[1] Zhang et. al. "Cross-based local stereo matchingusing orthogonal integral images." Circuits and Systemsfor Video Technology, IEEE Transactions on 19.7 (2009)[2] Kim et. al. "3D scene reconstruction from multiplespherical stereo pairs." International journal ofcomputer vision 104.1 (2013)

Convolutional Neural Network

L1 Filters

• Inputs: 1 million matching & non-matchingimage patches are fed into the network

• Output: Stereo matching cost, per pixel• Training: The 1st layer is convolutional; all

remaining layers are fully connected• Testing: All fully connected layers are

expressed as conv. layers so that the entiretest image can be processed at once

• Platform: Caffe w/ CuDNN GPU acceleration

[3] Kim et al. "Dynamic 3d scene reconstructionin outdoor environments." In Proc. IEEE Symp. on3D Data Processing and Visualization. 2010.[4] Žbontar et. al. "Computing the stereomatching cost with a convolutional neuralnetwork." arXiv preprint arXiv:1409.4326 (2014).