[IEEE 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG) - Jodhpur, India (2013.12.18-2013.12.21)] 2013 Fourth National

A Learning Based Approach for Dense StereoMatching with IGMRF Prior

Sonam Nahar and Manjunath V. JoshiDhirubhai Ambani-Institute of Information and Communication Technology,

Gandhinagar, Gujarat, India.Email: [email protected], [email protected]

Abstract—In this paper, we propose a learning based approachfor solving the problem of dense stereo matching problem usingedge preserving regularization prior. Given the test stereo pairand a training database consisting of disparity maps estimatedusing multiple views stereo images and their correspondingground truths, we obtain the disparity map for the test set. Wefirst obtain an initial disparity estimate by learning the disparitiesfrom the available database. A new learning based approach isproposed for obtaining the initial estimate that uses the estimatedand the true disparities. Since the disparity estimation is anill posed problem, we obtain the final disparity map using aregularization framework. The prior model for the disparitymap is chosen as an Inhomogeneous Gaussian Markov RandomField (IGMRF). Assuming that the spatial variations among thedisparity values captured in an initial estimate correspond to thevariations in true disparities, we obtain the IGMRF parametersat every pixel location using the initial estimate. A graph cutsbased method is used to optimize the energy function in order toobtain the global minimum. Experimental results on the standarddataset demonstrate the effectiveness of the proposed approach.

I. INTRODUCTION

Stereo vision has been an active research area in the fieldof computer vision for more than three decades. One commonconstraint in stereo is that the disparity should vary smoothlyalmost everywhere while preserving sharp discontinuities thatmay exist at object boundaries. Hence, it needs a properregularization to apply the constraints when solving it usingan energy minimization framework. Many of the current betterperforming techniques are based on markov random fields(MRF) formulations [1] as noted in [2], [3] and are solved us-ing graph cuts and belief propagation [4], [5]. These techniquesrequire users to set the global parameters i.e., regularizationweight λ and the edge preserving parameter, by trial anderror when working on a set of images. This process is timeconsuming and the solution is highly sensitive to the values ofthese parameters. In order to overcome this problem, authorsin [6] estimate the MRF parameters iteratively from the pre-viously estimated disparities using expectation maximization(EM) algorithm. The model parameters are learned from thesame stereo pair for which the disparities have to be estimated.Hence the parameters values may vary for different stereopairs.

In recent times, there has been considerable progress insolving the stereo vision problem using supervised learningdue to the increasing availability of the ground truth data.The model parameters are learned from the labeled trainingstereo images and the model is tested on unseen inputs. Arecent paper that employs this supervised learning paradigmis [7], where the authors present a conditional random field

(CRF) model for stereo vision and the maximum likelihoodestimator for model parameters is obtained via gradient de-scent. Computing the CRF parameters, however involves thepartition function which is intractable in cyclic graphs. Hencethe partition function is approximated by the model distributionwhich is obtained using graph cuts. However the methodhas high computational complexity and this approximationcan lead to poor disparity estimates. The number of CRFparameters used are also limited, affecting the solution. Themost recent work in learning based stereo vision is [8], wherethe authors present a CRF based model for stereo vision withnon-parametric cost functions, which can be learnt automat-ically using the structured support vector machines (SVM)with linear kernels. These learning methods give comparableperformance to the state of the art techniques but have notincorporated the discontinuity preservation constraint in theirmodels which is a challenging issue in stereo.

In general MRF based prior eg., truncated linear, trun-cated quadratic [4] etc. uses limited number of discontinuitypreservation parameters which considers the spatial variationin labels globally. We need a prior which considers the spatialvariations among the disparities locally. This motivates us touse an inhomogeneous prior which can adapt to the localstructure of the disparity map in order to provide betterdisparity estimates. Hence motivated by the IGMRF modelpresented in [9] and learning methods of [7], [8], we present alearning based method that uses multi view stereo with IGMRFprior. The IGMRF model is adaptive to the local structureof an image and hence eliminates the need for separate edgepreserving prior. It enables us to handle smooth as well assharp changes in disparity. A new learning based approachwith IGMRF regularization framework is proposed in which anapproximation to true disparity is obtained using a training setof disparity maps. The learning of the initial disparity estimateuses the disparities derived from the multiple baseline stereoapproach [10] and the ground truth disparity maps. The initialdisparity map is used in estimating the IGMRF parameters.Unlike the method in [6], our approach does not need a com-plex probabilistic model and iterative alternating optimizationto estimate the parameters. Our results show that the proposedapproach performs well for the planar regions in the scenewhile preserving the depth discontinuities. Though machinelearning approaches give a strong mathematical framework forproviding the solution but use of our method for estimatingmodel parameters is very simple and computationally lesstaxing.

II. PROBLEM FORMULATION

We formulate the dense stereo matching problem in aglobal energy minimization framework. Our goal is to min-imize a suitable energy function to get the desired solution.Typically for stereo, the following energy function is used:

E(d) = Edata(d) + Eprior(d), (1)

Here Edata(d) measures how well the disparity field d agreeswith the stereo image pair IL (left image) and IR (right image)of a scene and the prior energy term Eprior(d) measures howwell it agress with the prior knowledge about the solution. Wechoose our data term Edata(d) as follows:

Edata(d) =∑(x,y)

(IL(x, y)− IR(x+ d(x, y), y))2. (2)

To preserve depth discontinuities, we use an edge preserv-ing prior based on Inhomogeneous Gaussian Markov RandomField (IGMRF) model. We model the disparity map using theIGMRF model with an energy function that allows us to adjustamount of regularization locally.

III. IGMRF PRIOR MODEL

An MRF prior for the unknown disparity map can be de-scribed by using an energy function expressed as the Gibbsiandensity given by [1]

P (d) =1

Dθe−U(d), (3)

where d is the true disparity map to be estimated and Dθ is thepartition function. One can choose U as a quadratic form as torepresent an homogeneous MRF with a single global parametersay λ or few parameters based on the neghborhood, assumingthat the disparities are globally smooth [1]. This homogeneousprior tends to oversmooth the solution and blurs the edges.However, a more efficient model would be to choose it suchthat only the homogeneous regions are smooth and depthdiscontinuities are preserved in the form of edges. Authorsin [4], [5] use discontinuity preserving priors which need theoptimal tuning of parameters and the quality of the solutionis highly sensitive to the values of these parameters. We needa prior which considers the spatial variation among disparitieslocally. This motivates us to use an inhomogeneous prior whichcan adapt to the local structure in order to provide a betterdisparity or depth estimates. This helps us to eliminate theneed of manual tuning of parameters. IGMRF based priormodel has been used in satellite image debluring problem[9], multiresolution fusion of satellite images [11] and super-resolution of images [12]. We model the disparity map as aninhomogeneous Gaussian MRF with an energy function asdefined in [9]. For our stereo framework it is given by:

Eprior(d) =∑x

∑y

bXx,y(d(x− 1, y)− d(x, y))2

+bY x,y(d(x, y − 1)− d(x, y))2. (4)

Here bX and bY are the spatially adaptive IGMRF parame-ters at location (x, y) for vertical and horizontal directions,respectively, and d(x, y) is the true disparity at the samelocation. In the above energy function, the spatial dependencyat a pixel location is modelled by considering finite difference

approximation to first order differentiation in the horizontaland vertical directions. The IGMRF parameters are estimatedusing the maximum likelihood (ML) estimation and are givenby [9]

b̂Xx,y =1

max(4(d0(x− 1, y)− d0(x, y))2, 4)

, (5)

b̂Yx,y =1

max(4(d0(x, y)− d0(x, y − 1))2, 4)

, (6)

where d0(x, y) is the disparity value of the initial estimate atpixel location (x, y). Note that d0 is used in the above equationinstead of d since the true disparities are not available. In ourcase it is the initial estimate of the disparity which we assumeto be close to the true d and is obtained by using the proposedlearning method which is described in the next section. Weestimate two parameters bX and bY at each pixel location.These parameters help to obtain a solution which is less noisyin smooth areas and preserve the depth discontinuities in otherareas.

IV. LEARNING THE INITIAL ESTIMATE

Our first task is to estimate the IGMRF parameters atevery location. However this requires the availability of truedisparities which itself has to be estimated. In the absence ofstrong mathematical models and the non-availability of toolsto solve real life problems accurately, one has to look forpractical solutions. Hence, we use the initial estimate of thedisparities obtained by a suitable approach in order to derivea close approximation to the true disparities.

In recent years, there has been considerable progress insolving the stereo vision problem using machine learningmethods due to the increasing availability of the ground truthdata. Being motivated from the approach of [7], [8], we pro-pose a new learning based approach in order to obtain an initialdisparity map. We use this initial estimate to obtain the IGMRFparameters at every location. The advantage of our approachlies in reduction of overall computational complexity. This isbecause our learning of initial estimate uses already computedrough estimates of disparities in order to pick the true disparityvalues from the training set. For learning the disparity weconsider N sets of stereo images of various scenes and eachset has M rectified views. These stereo sets are publicallyavailable on Middleburry website [13]. We obtain the disparitymap for each of the M stereo images using the standardmultiple baseline stereo method [10]. A single level Gaussianpyramid decomposition is applied on these M views for eachstereo set and disparities are obtained on these using the sameapproach. We use the pyramidical decomposition in order tobetter constrain the solution while learning. Thus our databaseconsists of N dispaity maps estimated on the original dataconsisting of N stereo sets, N disparity maps correspondingto the Gaussian filtered and downsampled versions and N truedisparity maps.

Our learning method is as follows: Given a test stereo setwith M rectified views, we first use one level Gaussian de-composition on these images. The same approach of multiplebaseline stereo is used to obtain the disparity maps for thetest stereo set as well as for their downsampled versions. Wedivide the test disparity map into small patches of size 2 ×2 and estimate the final disparities for each patch separately.

Fig. 1. Block diagram for obtaining the initial estimate of disparity map.(Here we used M = 5 and N= 20 in our experiments).

Similarly all the disparity maps in the training set are alsodivided into small patches of size 2 × 2. We start with firstpatch of test disparity map with the corresponding singledisparity value in its downsampled version, and compare thesevalues with all the patches in training disparity maps with theircorresponding single disparity value in their downsampledversions respectively. The comparison is done using the sumof difference squared distance (SSD) measure. Let the patch ofkth disparity map in the training set gives the minimum SSDvalue with test disparity patch. The location of that patch isnoted and the true disparity patch of kth true disparity map inthe training set is extracted from the noted location. These truedisparities are the final learned disparities for the test stereoset. In this way we learn the disparities of a patch for the teststereo set. The same procedure is repeated for all the patchesin the test disparity map. This gives us the initial estimate forthe disparity map. Our proposed learning method for obtainingthe initial estimate is illustrated by the block diagram shownin Fig. 1. The advantage of our learning method is that it isa simple approach and do not need any model or parameterslearning from the database as done in [7], [8]. Disparities areestimated from the available data itself.

V. FINAL DISPARITY USING REGULARIZATIONFRAMEWORK

We compute the IGMRF parameters using the equations(5) and (6) respectively where d0 corresponds to the initiallearned estimate. Once the IGMRF parameters are known, anMAP estimation is used to obtain the final disparity map. TheIGMRF model on the disparity map serves as the prior forthe MAP estimation in which the prior parameters are alreadyknown. The data fitting term is derived from the equation(2). Finally, using equations (1), (2) and (4), the final energyfunction to be minimized is expressed as:

d̂ = argmind

[Edata(d) + Eprior(d)]. (7)

In order to carry out the minimization, we use a graph cutsbased optimization technique [4], which converges quickly.

VI. EXPERIMENTAL RESULTS

For performance evaluation we train and test our methodon Middlebury stereo images of 2001, 2003 and 2005 datasets[13]. Our training database consists of N = 20 stereo datasetand each dataset has M = 5 rectified views. Multiple baselinestereo algorithm is applied on each stereo dataset (with 5views) contained in the database in order to generate the dis-parity maps. Another set of disparity maps are also generatedon the downsampled versions of the stereo set in the databaseusing the same approach. Hence our training set consists of40 estimated and 20 ground truth disparity maps. To evaluatethe performance of our algorithm, we use a leave-one-outapproach to get the test dataset from the training set. Some ofthe test images with their ground truths are shown Fig. 2(a, b).Initial estimate is learned from the training set of disparity mapusing our new learning based approach discussed in section IV.The IGMRF parameters estimated at every location are usedwhile minimizing the energy function. Our energy function isminimized using swap algorithm based on the graph cuts [4].

The results of our approach on some of the test stereo setsare shown in Figure 2. Figure 2(c) shows the initial estimategenerated from the learning method discussed in section IVand Figure 2(d) shows the result of our proposed IGMRF basedframework. Our proposed method performs well in the homo-geneous and textured regions. It gives fairly accurate disparitiesin the background wall as well in the homogeneous areacovered by sawtooth spikes of “Sawtooth” image, as shownin the Figure 2(d). Our method also preserves the smoothvariation of disparities in the various planar regions of “Venus”image and preserve the discontinuities. It can be clearly seenfrom the Figure 2(d) that it localizes all the cones and theface of the statue well in the “Cones” image. In the middlepart of the “Moebius” image, a showpiece, table covered witha cloth, a small box and photo frame are at different levelsof depth. Our method preserves these depth discontinuitiesand the smoothness of disparities in each object very well.Similarly in the “Reindeer” image it assigns fair disparities forthe face while preserving edges among ear, nose and ears. Italso assigns almost same disparities in the background and thetable on which the head of reindeer is placed, a textureless area.From the above discussion it is clear that the proposed methodgives fair disparity estimates in homogeneous region as well asin textured regions while preserving the sharp discontinuitiesat object boundaries. It perserves sharp discontinuities clearlydue to the use of IGMRF model.

In order to compare our approach in terms of quantitativemeasure, we use the percentage of bad matching pixels witha disparity error tolerance of 1 as reported in [2]. We use twolearning based stereo methods of [7] and [8] for comparison.We also compare our method with the approach in [6], oneof the well known stereo algorithms based on regularization.We also use two latest stereo matching algorithm based onnonlocal cost aggregation [14] and edge preservation [15] formaking our comparison with current research in stereo. Theresults in Table 1 show that our method achieves performancesuperior to that of [7] and [8], which uses machine learning. Itgives comparable performance with the method of [6], [14] and[15]. Our proposed method is a combination of learning andregularization framework. The running time of the proposedmethod is very less because of the use of graph cuts basedoptimization technique. It is also computationally less taxing

(a) Left Image (b) Ground Truth (c) Initial Estimate (d) Final Result

Fig. 2. Results for (from top to bottom) Sawtooth, Venus, Cones, Teddy,Reindeer, Moebius stereo sets, (a) Left view, (b) Ground truth disparity map,(c) Initial estimate generated from the learning method discussed in sectionIV, (d) Final result using IGMRF regularization.

as compared to other learning based methods for stereo. Ourlearning method does not need any probabilistic model anddisparities are learned from the given disparity data itself.Calculation of IGMRF parameters are very simple as comparedto estimation of CRF parameters [7] and learning using SVM[8]. Our algorithm was tested on a computer with Core(TM) 2Duo CPU @1.40 GHz and 2.00 GB RAM. The computationtime for obtaining the initial estimate was approximately 70seconds and it was few seconds for optimization using graphcuts. The computation time for obtaining the initial estimatewill highly depend on the size of test image, closeness of theirfeature/texture with the test image as well as the number ofimages present in the training set. One may reduce the timecomplexity in obtaining the initial estimate by using trainingimages belonging to the same class as the test image and thiscan be done by using suitable image retrieval algorithm as afirst step prior to building the database.

VII. CONCLUSION

We have proposed a new learning based approach for densedisparity estimation using IGMRF model as a prior. IGMRFparameters are computed from the initial estimate which wasobtained from our proposed learning approach. Advantage of

TABLE I. PERFORMANCE EVALUATION ON MIDDLEBURY STEREODATASET [13]. COMPARISON IN TERMS OF ERROR RATES MEASURED AS

THE BAD MATCHING PIXELS, CALCULATED IN NON-OCCLUDED REGIONS.HERE “- ”INDICATES RESULTS NOT AVAILABLE.

Method Sawtooth Venus Cones Teddy Reindeer Moebius

Learning+CRF [7] - 1.3 10.8 11.1 14 13

Learning+SVM [8] - - 3.77 6.47 11.72 10.8

Parameter+MRF [6] 1.33 0.97 - - - -Nonlocal+aggregate [14] - 0.59 3.84 6.81 - -

Edge+Preserve [15] - 0.32 2.65 5.60 - -Proposed 1.35 0.99 3.46 4.50 5.47 3.14

our method is that it learns the disparities from the dataitself. It is computationally less taxing and do not requireany model. Experimental results indicate that the proposedapproach performs better in the homogeneous regions as wellas at object boundaries when compared to other methods.

REFERENCES

[1] S. Geman and D. Geman, “Stochastic relaxation, gibbs distributions,and the bayesian restoration of images,” IEEE transactions on PatternAnalysis and Machine Intelligence, vol. 6, pp. 721 –741, Nov. 1984.

[2] D. Scharstein, R. Szeliski, and R. Zabih, “A taxonomy and evaluationof dense two-frame stereo correspondence algorithms,” InternationalJournal of Computer Vision, vol. 47, no. 1/2/3, pp. 7–42, April-June2002.

[3] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, A. Agarwala, andC. Rother, “A comparative study of energy minimization methods formarkov random fields,” in In ECCV, 2006, pp. 16–29.

[4] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy min-imization via graph cuts,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 23, no. 11, pp. 1222 –1239, Nov 2001.

[5] J. Sun, N. Zheng, and H. Shum, “Stereo matching using belief propaga-tion,” IEEE transactions on Pattern Analysis and Machine Intelligence,vol. 25, no. 7, pp. 787 – 800, July 2003.

[6] L. Zhang and S. Seitz, “Parameter estimation for MRF stereo,” inComputer Vision and Pattern Recognition, 2005. CVPR 2005. IEEEComputer Society Conference on, vol. 2, june 2005, pp. 288 – 295 vol.2.

[7] D. Scharstein and C. Pal, “Learning conditional random fields forstereo,” Computer Vision and Pattern Recognition., pp. 1 –8, June 2007.

[8] Y. Li and D. Huttenlocher, “Learning for stereo vision using thestructured support vector machine,” in Computer Vision and PatternRecognition, 2008. CVPR 2008. IEEE Conference on, 2008, pp. 1–8.

[9] A. Jalobeanu, L. Blanc-Feraud, and J. Zerubia, “An adaptive Gaussianmodel for satellite image deblurring,” Image Processing, IEEE Trans-actions on, vol. 13, no. 4, pp. 613 –621, april 2004.

[10] M. Okutomi and T. Kanade, “A multiple-baseline stereo,” IEEE Trans-actions on Pattern Analysis and Machine Intelligence, vol. 15, pp. 353–363, 1993.

[11] M. Joshi and A. Jalobeanu, “MAP estimation for multiresolution fusionin remotely sensed images using an IGMRF prior model,” Geoscienceand Remote Sensing, IEEE Transactions on, vol. 48, no. 3, pp. 1245–1255, march 2010.

[12] P. Gajjar and M. Joshi, “New learning based super-resolution: Use ofDWT and IGMRF prior,” Image Processing, IEEE Transactions on,vol. 19, no. 5, pp. 1201 –1213, may 2010.

[13] D. Scharstein, R. Szeliski, and R. Zabih, “Stereo data sets,” http://vision.middlebury.edu/stereo/data.

[14] Q. Yang, “A non-local cost aggregation method for stereo matching,”in Computer Vision and Pattern Recognition (CVPR), IEEE Conferenceon, 2012, pp. 1402–1409.

[15] C. Cigla and A. Alatan, “Efficient edge-preserving stereo matching,”in Computer Vision Workshops (ICCV Workshops), IEEE InternationalConference on, 2011, pp. 696–699.

Documents

[IEEE 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG) - Jodhpur, India (2013.12.18-2013.12.21)] 2013 Fourth National