8
Efficient Max-Margin Learning in Laplacian MRFs for Monocular Depth Estimation Dhruv Batra TTI Chicago [email protected] Ashutosh Saxena Cornell University [email protected] Abstract While designing a Markov Random Field, especially one with continuous states such as in the task of depth estima- tion, one is presented with the modeling choice of what dis- tribution to use. While distributions such as Gaussian are easy to work with, because of tractable inference and learn- ing, they often do not model the data well. In particular, the statistics of natural images are heavy-tailed and for such heavy-tailed distributions, the learning and inference can only be done approximately. In this paper, we present learning and inference tech- niques for heavy-tailed Laplacian MRFs. We first show that exact inference is tractable and convex. Even though the energy terms are not linear in the parameters, we then de- velop max-margin learning methods for learning the pa- rameters. Together with dual-decomposition techniques, our learning is scalable in the number of training images. Our extensive experiments show that our learning technique improves the performance significantly over traditional ap- proximate learning techniques, and we achieve state-of-art performance on a standard dataset. 1. Introduction Markov Random Fields (MRF) have been successfully applied to a number of computer vision problems, such as image segmentation, denoising and inpainting, stereo, opti- cal flow, and single-image depth estimation. While design- ing an MRF, especially one with continuous states such as in the task of depth estimation, one is presented with several modeling choices; an important one being the choice of the prior distribution on the random variables. This may range from “simple” priors (e.g. Gaussian MRFs [24]) to highly non-convex priors in which learning and inference can only be done approximately. Gaussian priors seem attractive from a computational perspective since Tappen et al.[24] showed that learning and inference in Gaussian MRFs boils down to linear algebra operations. However, a large body Figure 1: Examples of cases where Laplacian distribution could be better suited for modeling scene statistics. (Left) Log- histogram of relative depths collected from 400 laser scans col- lected by Saxena et. al. [17, 18]; the relative depths are well mod- eled by a Laplacian distribution than a Gaussian. For more details, see [9]. (Right) Mean spectral signatures of 6000 images of nat- ural scenes from Torralba et. al. [26]. (Images taken from [18] and [26] with permission.) of work in natural image statistics [15, 26, 28] suggests that Gaussian priors not well-suited for modeling natural im- age statistics, and typically tend to over-smooth the results for tasks like denoising, deblurring and depth-estimation. For example, Torralba and Oliva [26] studied the statistics of natural images for object categorization tasks and these statistics are highly “non-Gaussian”, in that they have heavy tails. Similarly, in their work on monocular depth estima- tion, Saxena et al.[17, 18] studied log-histograms of relative depths collected from 400 laser scans and found that these relative depths are better modeled by a Laplacian distribu- tion than a Gaussian. In this work, we focus on Laplacian-MRFs (LMRFs) which use a Laplace distribution for modeling the statistics of random variables. These heavy-tailed distributions are often an ideal choice for vision problems. Moreover, infer- ence in these models is convex and in fact a linear program. This results in several nice properties including that even with long-range interactions in the MRF, inference can be performed easily. Despite these nice properties, we believe LMRFs have not been exploited to their full potential in computer vision because the learning problem is intractable. More impor- tantly, while discriminative structured learning approaches 1

Efficient Max-Margin Learning in Laplacian MRFs for Monocular Depth ...dbatra/papers/batra_cvpr11_lcrf.pdf · Efficient Max-Margin Learning in Laplacian MRFs for Monocular Depth

  • Upload
    hakhanh

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Efficient Max-Margin Learning in Laplacian MRFs forMonocular Depth Estimation

Dhruv BatraTTI Chicago

[email protected]

Ashutosh SaxenaCornell University

[email protected]

Abstract

While designing a Markov Random Field, especially onewith continuous states such as in the task of depth estima-tion, one is presented with the modeling choice of what dis-tribution to use. While distributions such as Gaussian areeasy to work with, because of tractable inference and learn-ing, they often do not model the data well. In particular, thestatistics of natural images are heavy-tailed and for suchheavy-tailed distributions, the learning and inference canonly be done approximately.

In this paper, we present learning and inference tech-niques for heavy-tailed Laplacian MRFs. We first show thatexact inference is tractable and convex. Even though theenergy terms are not linear in the parameters, we then de-velop max-margin learning methods for learning the pa-rameters. Together with dual-decomposition techniques,our learning is scalable in the number of training images.Our extensive experiments show that our learning techniqueimproves the performance significantly over traditional ap-proximate learning techniques, and we achieve state-of-artperformance on a standard dataset.

1. Introduction

Markov Random Fields (MRF) have been successfullyapplied to a number of computer vision problems, such asimage segmentation, denoising and inpainting, stereo, opti-cal flow, and single-image depth estimation. While design-ing an MRF, especially one with continuous states such asin the task of depth estimation, one is presented with severalmodeling choices; an important one being the choice of theprior distribution on the random variables. This may rangefrom “simple” priors (e.g. Gaussian MRFs [24]) to highlynon-convex priors in which learning and inference can onlybe done approximately. Gaussian priors seem attractivefrom a computational perspective since Tappen et al. [24]showed that learning and inference in Gaussian MRFs boilsdown to linear algebra operations. However, a large body

Figure 1: Examples of cases where Laplacian distributioncould be better suited for modeling scene statistics. (Left) Log-histogram of relative depths collected from 400 laser scans col-lected by Saxena et. al. [17, 18]; the relative depths are well mod-eled by a Laplacian distribution than a Gaussian. For more details,see [9]. (Right) Mean spectral signatures of 6000 images of nat-ural scenes from Torralba et. al. [26]. (Images taken from [18]and [26] with permission.)

of work in natural image statistics [15, 26, 28] suggests thatGaussian priors not well-suited for modeling natural im-age statistics, and typically tend to over-smooth the resultsfor tasks like denoising, deblurring and depth-estimation.For example, Torralba and Oliva [26] studied the statisticsof natural images for object categorization tasks and thesestatistics are highly “non-Gaussian”, in that they have heavytails. Similarly, in their work on monocular depth estima-tion, Saxena et al. [17,18] studied log-histograms of relativedepths collected from 400 laser scans and found that theserelative depths are better modeled by a Laplacian distribu-tion than a Gaussian.

In this work, we focus on Laplacian-MRFs (LMRFs)which use a Laplace distribution for modeling the statisticsof random variables. These heavy-tailed distributions areoften an ideal choice for vision problems. Moreover, infer-ence in these models is convex and in fact a linear program.This results in several nice properties including that evenwith long-range interactions in the MRF, inference can beperformed easily.

Despite these nice properties, we believe LMRFs havenot been exploited to their full potential in computer visionbecause the learning problem is intractable. More impor-tantly, while discriminative structured learning approaches

1

like Structured-SVMs [4, 27] are gaining popularity in vi-sion, these standard approaches are not directly applicablefor LMRFs because the log-conditional probability (or theenergy function) is not linear in the parameters. structured-SVM learning We believe this is essentially the reason whyLMRFs have not been more widely employed.

In this paper, we present learning techniques for thismodel in which we convert the learning problem into a max-margin problem by linearizing `1-norm constraints. Thisenables the well studied and powerful max-margin frame-work to be applicable to learning Laplacian-MRFs. More-over, our proposed algorithm uses ideas from the dual-decomposition [1,5] literature to decompose the problem oflearning parameters from a dataset of images into smallerlearning problems over individual training images. Wepresent an efficient dual-decomposition-based frameworkthat scales linearly with the number of training images andis very efficient in practice. This make our approach highlyparallelizable and scalable to a large number of training im-ages.

We apply LMRFs to the problem of single image depthestimation, which is a notoriously hard problem and math-ematically ill-posed because forming an image projects the3D world onto a 2D image. Interestingly, while Saxena etal. [17, 19] originally proposed a Laplacian-MRF to modelthe depth as a function of the image features, their learningalgorithm was only approximate in which they neglectedthe partition function (i.e., used pseudo-likelihood). In thiswork, we will show that by using our learning method, weobtain significant improvements in the accuracy of depthestimates. Specifically, we achieve state-of-art performanceon one common error metric and are competitive with thestate-of-art approaches on the other.

At a higher level, our work explores emerging links be-tween learning approaches (such as structured SVM) inthe machine learning literature and efficient decompositiontechniques (such as dual-decomposition) in the optimiza-tion literature.

2. Related WorkContinuous Random Fields. Continuous Random Fieldssuch as Gaussian Random Fields have a rich history in com-puter vision [23] and machine learning. It is hard to do jus-tice to the literature in this area, and we refer the readerto [10] for more details. Parameter learning is intractable inmost models and typically approximate learning methodsare proposed. One notable exception is the work of Tap-pen et al. [24], who showed that learning and inference inGaussian Conditional Random Fields (CRFs) boils down tolinear algebra operations, and is thus tractable. Our goal isto develop tractable parameter learning and inference tech-niques for Laplacian Markov Random Fields.Laplacian Priors. As we mentioned earlier, a large body

of work in natural image statistics [15, 26, 28] suggests thatGaussian priors not well-suited for modeling natural imagestatistics. Krishnan and Fergus [11] also note the heavy-tailed distribution of gradients in natural images, and pro-pose to use a Hyper-Laplacian prior (p(y) ∝ exp−λ|y|

α

,with 0.5 ≤ α ≤ 0.8). However, this problem is non-convexand they use look-up tables for scalar y (and analytical so-lutions for specific value of α = 1

2 ,23 ). While it true that

in some cases Hyper-Laplacian priors are a better fit, Lapla-cian priors are often the closest one can get (specially ascompared to Gaussian) while staying tractable.

We note that Laplacian terms have been explored in sev-eral contexts, including Lasso shrinkage methods [6] andmore recently used in sparse coding works [29, 30]. Thesemethods focus on inference techniques (e.g., an L-1 normminimization can be written in the form of a linear pro-gram [2]). This paper, on the other hand, is concerned withthe specific problem of parameter learning in MRFs that usethese priors.

We should also point out that although the work ofZhu [31] uses a similar name as us, they use Laplacian pri-ors for sparse structural bias, while we use Laplacian priorson the variables of the model.Max-Margin Learning. Taskar et. al. [25] proposed max-margin method of training Markov networks and Tsochan-taridis et al. [27] propose a structured support-vector ma-chine framework for learning structure models. Both tech-niques have been widely used since their introduction.However in their formulation, the energy terms (i.e., the net-work potentials) are restricted to be linear in the parameters(e.g., log-linear models). This is not the case for LMRFssince they contain `1-norm terms.Single Image Depth Estimation. Saxena et al. [17, 19]considered the problem of depth estimation from a sin-gle image using a Markov Random Field, and found thatLaplacian potentials in the MRFs significantly outperformGaussian potentials, even with their approximate learningin which they ignored the partition function. Sudderth etal. [22] used hierarchical Dirichlet Processes in order tomodel the depth of objects, but the learning and inferencein these models is only approximate. More recently, Liu etal. [14] proposed a semantic category based depth estima-tion model that is the current state-of-art on the dataset ofSaxena et al. [17, 19].

3. Laplacian MRFs

Definition: We consider a Gibbs distribution over a set ofrandom variables y = {y1, y2, . . . , yn} (s.t. yu ∈ R), and agraph G = (V, E) defined over these variables:

P (y|X , θ) = 1

Zexp(−E(y|X , θ)), (1)

where X = [xT1 ; xT2 ; . . . ;x

Tn ] is the matrix of feature vec-

tors extracted at the labeling sites; θ is the parameter vector;Z is the partition function that does not depend on y but de-pends on θ. The energy term for an LMRFs is given by:

E(y|X , θ) = ||y −X θ||1 +∑

(u,v)∈E

λ(u, v)|yu − yv|1 (2)

where ||a||1 =∑Ni=1 |ai| for a ∈ RN . With a slight abuse

of notation we use y = (y1, . . . , yn) ∈ Rn to also denote ajoint labeling for all sites. Here, the first term is the associa-tion potential where it models the label yu as a local depen-dency on the features X . The second term is the interactionterm that is a data-dependent smoothing function (λ(u, v)depends on the features at u and v). E is the edge-set of thegraph. Note that we place no restrictions on the size or typeof neighborhood and there could be arbitrary “long-range”links between the variables. Also note that both the terms inthe model above—the local term as well as the interactionterm—follow a Laplacian distribution.

For the depth labeling application, y would denote thedepths at pixels in the image. As we mentioned, Lapla-cian distributions are interesting in that they have heaviertails. These heavier tails are more robust to outliers (unlikeGaussian that have negligible mass after 3σ). This has animportant effect on the first term that models the output y asa linear function of the image features. If we ignore the sec-ond term in our model, then learning problem is equivalentto L1 regression [6], which is well-known to handle outlierswell. The Laplacian distribution on the second term has alsoa similar effect of handling outliers well.

4. AlgorithmWe now describe our proposed approach. Before we go

into details about parameter learning we need to describeinference in LMRFs.

4.1. Inference

We focus on maximum a posteriori inference in thismodel, which can be written as:

y∗ = argmaxy∈Rn

P (y|X , θ) (3a)

= argminy∈Rn

log(Z) + E(y|X , θ) (3b)

= argminy∈Rn

||y −X θ||1 + ∑(u,v)∈E

λ(u, v)|yu − yv|

(3c)

Note that the partition function Z is independent of y (butnot of parameters θ).

We now show how MAP inference in Laplacian MRFscan be written as an `1-norm minimization problem. First

define d .= X θ, and let Q be the (weighted) incidence ma-

trix of graph G, such that rows of Qy give the differencesof neighboring labels multiplied by λ(u, v). Thus, the MAPproblem becomes:

y∗ = argminy∈Rn

{||y − d||1 + ||Qy||1

}(4a)

= argminy∈Rn

||Ay − b||1, (4b)

where A = [ I ; Q ] and b = [ d ; 0 ].This can further be simplified to a linear program (LP)

by introducing auxiliary variables t ∈ Rn:

y∗ = argminy∈Rn,t∈Rn

1T t (5)

s.t. t ≥ Ay − b (6)t ≥ −(Ay − b) (7)

The trick above is to notice that an absolute value mini-mization can be replaced by linear bounds from below. Aswe will see next, this trick helps us more than once.

4.2. Parameter Learning

Parameter learning involves learning the optimal val-ues of parameter θ from labeled training data. Follow-ing the work of Tsochantaridis et al. [27], we use a max-margin framework to learn these parameters. In a typicalStructural-SVM (SSVM) [27] or Max-Margin MRF (M3-Nets) [4], the energy function is a linear in parameters.Therefore, while such techniques work for many MRFs(such as ones having log-linear potentials), they do not ap-ply to MRFs that have energy function non-linear in θ. Inour case, the energy function contains `1-norm terms.

Let us first consider a single training sample. Intuitively,SSVM corresponds to maximizing the margin between theground-truth energy and the energy of all other possiblelabellings, under the presence of an `2-norm regularizer.Moreover, SSVM generalizes the soft-margin of SVMs byallowing slack variables εi. If ygt are the ground-truth la-bels, then the learning problem can be written as:

(MM : I) argminθ,εi

∑i∈I

εi+ || θ ||22

s.t. E(ygt|X , θ) ≤ E(yi|X , θ) + εi

εi ≥ 0 ∀i ∈ I (8a)

In our case, the set of all possible “other” labellings I isan uncountably infinitely large set, and thus this program(MM : I) cannot even be written down. Following thework of Tsochantaridis et al. [27], we use a cutting-planeapproach, where we initialize this program with a small setof other labellings I, learn θ, and if the optimal labelingunder this θ is not already in the set I, we add it to the setand repeat.

We now describe how to solve the program (MM : I):

(MM : I) argminθ,εi

∑i∈I

εi+ || θ ||22

s.t. ||ygt −X θ||1 + ||Qygt||1 ≤ ||yi −X θ||1 + ||Qyi||1 + εi

εi ≥ 0 ∀i ∈ I (9a)

Note that the constraints are not linear in θ because of `1-norm terms. Furthermore note that the variables θ multiplywith X and therefore every absolute value term contains allthe components of θ. This does not allows use of searchalgorithms such as in [12].

We now show how the above program (MM : I) canbe converted into a QP using auxiliary variables: Dgt,Di ∈Rn. Intuitively, we use the same trick as we did in Eqn (5)to convert and `1-norm minimization into an LP.

(MMQP : I)

argminθ,εi,Dgt,Di

∑i∈I

εi+ || θ ||22 +

n∑j=1

(Dgtj +

∑i∈I

Dij)

s.t.

n∑j=1

(Dgtj −D

ij) ≤ ||Qyi||1 − ||Qygt||1 + εi ∀i ∈ I

ygt −X θ ≤ Dgt ≥ −(ygt −X θ)(10a)

yi −X θ ≤ Di ≥ −(yi −X θ) (10b)εi ≥ 0 (10c)

All constraints in the above program (MMQP : I) arelinear in θ, εi,Dgt,Di, and thus this program is a solvablequadratic program. Thus we can follow the same cutting-plane learning framework as Tsochantaridis et al. [27], i.e.we initialize this program with a small set of other labelingsI, learn θ, and if the optimal labeling under this θ is notalready in the set I, we add it to the set and repeat.

It is important to point out that the drawback of this ap-proach is that the above program (MMQP : I) includesvector constraints (Eqns. 10a,10b) of dimensions equal tothe number of random variables (n). While the constraintEqns. 10a does not grow with iterations, the constraintEqns. 10b does. Thus, with iteration of the cutting planemethod, each additional “other” labeling added to the list Iadds O(n) more constraints to the QP. Since we are work-ing with images, n is typically very large, and thus the QPmay become intractable. However, as we see next, we useideas from the dual-decomposition [?, 1, 5] literature to re-strict this QP to a manageable size.

4.3. Extension to Multiple Training Images via La-grangian Decomposition

Let us now extend this learning framework to learn frommultiple images. Let the training dataset be indexed by

T = {1, 2, . . . , T}, and let vectors ε(t) = {ε(t)i : i ∈ I},D(t) = {D(t),gt, D(t),i : i ∈ I} hold all slack variables andauxiliary variables for training image t. Also, for brevity ofdescription, let us denote linear constraints Eqn. 10 with thepolytope P(t). We can now write down a straightforwardgeneralization of the above program (MMQP : I) to mul-tiple training images:

(MMQP : IT )

argminθ,ε(t),D(t)

|| θ ||22 +∑t∈T

ε(t) · 1+D(t) · 1

s.t. {ε(t),D(t)} ∈ P(t)

(11a)

Clearly, as the size of the training dataset increases, thisprogram becomes larger, and very quickly impractical. Wefollow a decomposition approach, where we solve a relax-ation of this problem which easily decomposes to smallerindependent sub-problems for each training image. Thisenables us to solve the problem over a distributed architec-ture or a cloud of machines, and thus scales well to largedatasets. We describe this relaxation next. First, we repa-rameterize the above program by allocating to each trainingimage it’s own copy of the parameters θ(t):

(MMQP : IT 2)

argminθ,θ(t),ε(t),D(t)

∑t∈T

ε(t) · 1+D(t) · 1+ || θ(t) ||22

s.t. {ε(t),D(t)} ∈ P(t)

(12a)

θ(t) = θ (12b)

The above program (MMQP : IT 2) uses a globalvariable θ to force all training images to have the sameparameters, and thus is equivalent to the earlier program(MMQP : IT ). However, we can relax constraintEqn. 12b to not hold at equality. Let us rather incorporateit into a penalty function with cost λ(t). This allows us tobreak the problem into independent sub-problems.

(LR : IT )

argminθ,θ(t),ε(t),D(t)

∑t∈T

ε(t) · 1+D(t) · 1+ ||θ(t)||22 − λ(t)(θ(t) − θ)

s.t. {ε(t),D(t)} ∈ P(t)

(13a)

Let us define the independent sub-problems as a function

of the dual variables (λ(t)):

F(λ(t)) = minθ(t),ε(t),D(t)

ε(t) · 1+D(t) · 1+ ||θ(t)||22 − λ(t)θ(t)

(14a)

s.t. {ε(t),D(t)} ∈ P(t) (14b)

We can now search for the tightest relaxation by optimiz-ing over the dual variables (λ(t)):

(LD : IT ) maxλ(t)

∑t∈TF(λ(t))− θ ·

(∑t∈T

λ(t)

)(15a)

⇐⇒

maxλ(t)

∑t∈TF(λ(t)) (15b)

s.t.∑t∈T

λ(t) = 0 (15c)

Here, the coefficient of θ must 0 because otherwise therelaxed program (LR : IT ) would never have a finitevalue. The above problem is simply the Lagrangian Dualof (MMQP : IT ). In a manner similar to Komodakis [?],we solve this dual problem via projected subgradient as-cent. It is easy to verify that the subgradient of each sub-problem is simply the optimal parameter learned from thatsub-problem. The projection step involves satisfying thezero-mean constraint of (15c), which can be enforced bysimply subtracting the mean of the dual variables.

5. Single Image Depth EstimationWe apply our learning algorithm on the problem of es-

timating depth from a single image (see Fig. 2). This is anextremely challenging problem because an image is a pro-jection of 3D environment onto a 2D plane, and thus loos-ing the depth information. Often local features are not goodenough for estimating depth (e.g., both sky and wall can beblue, a gray area could be sidewalk or a wall), therefore useof MRF (with appropriate potentials) and good learning andinference techniques are important.

The problem is formulated as predicting the depth y ∈RN at every point in the image (for a total ofN points in theimage), given image features x ∈ Rk at the correspondingpoints in the image. The matrix X ∈ RN×k represents thefeatures stacked together for all the points in the image.

The second term in the model, i.e., (yu−yv) has an addi-tional known constant λuv that determines how strong is theinteraction between neighboring depths. This is a spatially-varying term that indicates occlusion boundaries in the im-age. We estimate this term based on the edge properties.

We computed features x using the publicly-availablecode from Make3D [17]. In [17, 18], they perform learning

100 101 102 10315

20

25

30

# Training Images

RM

SE

LMRFPINV

(a) RMSE

100 101 102 103

0.2

0.25

0.3

0.35

0.4

0.45

0.5

# Training Images

Avg

log1

0 Er

ror

LMRFPINV

(b) Avg-log10 Error

100 101 102 103

0.5

1

1.5

2

# Training Images

Avg

Rel

Erro

r

LMRFPINV

(c) Avg-Rel Error

Figure 3: Error on test set vs the number of training images.(Red) Pseudo-likelihood approximate training. (Blue) Our train-ing method. Note that our training method performs well witheven a small number of training images.

by ignoring the partition function, and simply minimizingthe L1 error ||Y −Xθ||1 for their Laplacian model (and L2

error ||Y − Xθ||2 for their Gaussian model). This methodcompletely ignored the effect of the neighborhood term inthe MRF, i.e., it considers every point as independent duringtraining. The second term plays a role during inference onlyin their method, and therefore the results are over-smoothed.More importantly, the estimated parameters are not able touse the fact that MRF potentials will be available during in-ference, and therefore are sub-optimal. (E.g., intuitively ifwe know that edge terms distinguish sky and building prettywell, then training method should choose parameters that doa better job on other parts of the scene.)

In [19, 20], they have an extended model in which theyhave more parameters (in addition to θ). They Multi-conditional learning [?] for parameter estimation, wherethey learn θ followed by other parameters and iterate. How-ever, in the step for learning θ, they still use pseudo-log like-lihood, and therefore the parameters are significantly sub-optimal.

6. ExperimentsWe test our approach on Make3D Range Image

dataset [18,19]. The dataset consists of a total of 534 imageswith ground-truth depths obtained from laser scanner. Thelaser depths are sometimes noisy because of reflection/darkobjects (scans are incorrect/missing), range of the scanner(limited to 80m), and alignment problems (the images anddepths are not perfectly aligned). The variety of environ-ments in this dataset (such as roads, buildings, trees anda few indoor corridor type scenes) present situations suchas sharp depth changes (occlusions), thin long structures(trees, poles), etc.

We measure our performance on the commonly used er-ror metric on this dataset called rel-error, defined as(y− y)/y, where y is the ground-truth depth. Another errormetric that is also sometimes used is log-10 error metric| log yi − log yi|). However, it does not measure scale ofdepths correctly. I.e., if the predicted depths y are off by aconstant factor as y = αy, then the log-10 error would bezero. For the sake of completeness, we also report RMSE

Laser Image Predicted Laser Image Predicted

A B

C D

E F

G H

I J

K L

Figure 2: Results of using our learning algorithm. (Left) Laser ground-truth depths. (Middle) Original image. (Right)Predicted depths. (Best viewed in color.)

errors.We first evaluate the effects of learning method on our

dataset, and therefore use same image features ( [19]), sim-ilar MRF structure, and the only thing we change is thelearning algorithm. Figure 4 shows the results using dif-ferent learning algorithms. We see that our learning methodperforms significantly better than SCN, LGK and Make3D.SCN [17] used pseudo-likelihood and had an error of 0.530.Make3D [19] reformulated the learning problem as a Multi-Conditional Learning to obtain an accuracy of 0.458, whilewe obtain an accuracy of 0.378.

We then analyze the performance of our learning methodwith the size of training set. Figure 3 shows that our learn-ing method is also less prone to over-fitting as compared topseudo-likelihood, i.e., even with a small amount of train-ing data, it can perform reasonably good results on the testset.

We finally compare our numbers to the other state-of-the-art works on this dataset. The comparison numbers areshown in Table 1. This table also lists the major “method”used in order to improve performance over SCN work onsingle image depth estimation [17].1 Note that these meth-ods use several other techniques (and sometimes furtheradditional data in order to predict depths). For exampleSCN [17] uses a hierarchical MRF, LGK [14] use otherinformation (semantic segmentation, and additional MRFpriors) to improve performance on two metrics. Heitzet. al.’s Cascaded Classification Models combines informa-tion from object detection, image segmentation and scenecategorization, and obtains an error of 15.4m. All of theseworks focused on using “context” (i.e., other types of in-formation) in order to improve performance. On the otherhand, our method only uses a flat 4-connected grid asthe MRF and relies on our new learning algorithm to get0.362 rel-error vs 0.375 that is current state-of-the-art onthis dataset. Hopefully, our numbers could be improvedeven further if combine these ideas together—our learningmethod, additional semantic modeling of LGK, and multi-task information of Heitz et al.

In Figure 2, we present predicted depths given the sin-gle image in the test-set. For comparison, we also show thelaser ground-truth depths. Our algorithm gives quite reason-able depths for most of the scenes. In general, MRFs sufferfrom the problem of over-smoothing (e.g., [17, 19]). How-ever, this problem is less acute in our method—we believethis is because our learning method learns the parameterswhile taking into account the edge terms in the MRF, andthus results in sharper (see Fig. 2-E,H,I) and more accuratedepths. The problem still persists in some cases, such as inFig. 2-J, where our algorithm was confused by the textureof the leaves, and only a over-smoothed result.

1Cross-validation results (not reported here) also showed similar re-sults.

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

Chance   SCN   Make3D   LGK   LMRF  (Our  Model)  

Figure 4: Effects of different kinds of learning onpointwise-MRF with Make3D features. PlottingAvg-Rel Error. This figure shows that our learning improves theperformance significantly over approximate training methods.

Note that the ground-truth labels were limited to a rangeof 80 meters, and therefore, in most of the images in Fig-ure 2, we see that far-away structures are measured as 80m(same as sky) by the laser. MRF edge terms provide aimage-dependent prior of sorts that enable extrapolation us-ing the image information. Our learning parameters is ableto take advantage of this effect, and therefore we often seethat even far-away parts in the image are predicted rea-sonably (and the actual ground-truth label is wrong!). SeeFig. 2-A,D,F,G,H,I,K,L. However, it sometimes also has un-wanted side-effects—for example in Fig. 2-F,I,J, the edge-terms λ were not good enough, resulting in over-smoothedand incorrect depths for leaves.

In Fig. 2-B, we see the reflections of another building,trees and sky into a glass-paned transparent wall. The lasersends out an active light, measures the time-of-flight, andpasses through glass (it does not see the reflections). On theother hand, our algorithm relies on the image to estimatedepth and we see that it estimates the depth of the reflectedstructures instead.

7. ConclusionsIn this paper, we considered continuous-valued Lapla-

cian MRFs that have heavy-tails. Even with tractable andconvex inference, learning in this model was done only ap-proximately in prior work. Even though the energy termin this model is not linear in the parameters, we devel-oped max-margin learning methods in this work. We alsopresented dual-decomposition techniques to make learningscalable in the number of training images.

Our learning method takes into account the MRF edgeterms while estimating the parameters, and therefore uti-lizes the full potential of Laplacian MRFs. In our experi-ments, our method obtains sharper depthmaps and also im-proves significantly over traditional approximate learningtechniques.

Future work involves exploring sophisticated decompo-sition methods like Augmented Lagrangian methods. In ad-dition, often the `1-norm minimization problems are used

Table 1: Summary of results for depth estimation task. The empty entrees means that those numbers were not reported inthe prior work. Note that different methods use different features, different structure of the MRFs, as well as other additionalinformation in some cases.

Method Description RMSE-linear Avg-log10 Avg-Rel(main improvement source)

Chance predict mean depthmap 28 0.334 0.698SCN [17] Hierarchical, pointwise MRF. Laplacian potentials. 16.7 0.198 0.530HEH [8] surface layout, discrete MRF - 0.320 1.423Make3D - pointwise MRF [19] tertiary connections - 0.149 0.458Make3D - superpixel MRF [20] superpixel formulation - 0.187 0.370CCM [7] - cascaded models object detection, segmentation, categorization 15.4m - -LGK - pointwise MRF [14] semantic segmentation, geometry - 0.149 0.375LGK - superpixel MRF [14] semantic segmentation, geometry - 0.148 0.379Our model - pointwise MRF learning method 15.8 0.168 0.362

as a (successful) heuristic for solving L0 (called cardinalityconstraint problems). We believe that the learning and in-ference techniques for LMRF presented in this paper couldalso be useful for MRF with other potentials (e.g., with car-dinality constraints) as well.

References[1] D. P. Bertsekas. Nonlinear Programming. Athena Scientific,

2nd edition, September 1999. 2, 4[2] S. Boyd and L. Vandenberghe. Convex Optimization. Cam-

bridge University Press, March 2004. 2[3] E. Candes, J. Romberg, and T. Tao. Robust uncertainty prin-

ciples: Exact signal reconstruction from highly incompletefrequency information. 2006.

[4] B. T. Carlos, C. Guestrin, and D. Koller. Max-margin markovnetworks. In NIPS, 2003. 2, 3

[5] M. Guignard. Lagrangean relaxation. TOP: An Official Jour-nal of the Spanish Society of Statistics and Operations Re-search, 11(2):151–200, 2003. 2, 4

[6] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. Theelements of statistical learning: data mining, inference andprediction, volume 27. Springer, 2005. 2, 3

[7] G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascadedclassification models: Combining models for holistic sceneunderstanding. In NIPS, 2008. 8

[8] D. Hoiem, A. Efros, and M. Hebert. Recovering surface lay-out from an image. IJCV, 2007. 8

[9] J. Huang, A. Lee, and D. Mumford. Statistics of range im-ages. In IEEE Conference on Computer Vision and PatternRecognition, 2000. Proceedings, volume 1, 2000. 1

[10] D. Koller and N. Friedman. Probabilistic Graphical Models.The MIT Press, 2010. 2

[11] D. Krishnan and R. Fergus. Fast image deconvolution usinghyper-laplacian priors. In NIPS, 2009. 2

[12] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparsecoding algorithms. In NIPS, 2006. 4

[13] B. Liu, S. Gould, and D. Koller. Single image depth estima-tion from predicted semantic labels. In CVPR, 2010.

[14] B. Liu, S. Gould, and D. Koller. Single image depth estima-tion from predicted semantic labels. In CVPR, 2010. 2, 7,8

[15] B. A. Olshausen and D. J. Field. Emergence of simple-cellreceptive field properties by learning a sparse code for natu-ral images. Nature, 381(6583):607–609, June 1996. 1, 2

[16] A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth fromsingle monocular images. In Neural Information ProcessingSystems (NIPS) 18, 2005.

[17] A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth fromsingle monocular images. In In NIPS 18, 2005. 1, 2, 5, 7, 8

[18] A. Saxena, S. H. Chung, and A. Y. Ng. 3-d depth recon-struction from a single still image. International Journal ofComputer Vision (IJCV), 2007. 1, 5

[19] A. Saxena, M. Sun, and A. Ng. Make3d: learning 3d scenestructure from a single still image. IEEE transactions onpattern analysis and machine intelligence, pages 824–840,2008. 2, 5, 7, 8

[20] A. Saxena, M. Sun, and A. Y. Ng. Learning 3-d scene struc-ture from a single still image. In ICCV workshop on 3DRepresentation for Recognition (3dRR-07), 2007. 5, 8

[21] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3dscene structure from a single still image. IEEE Transactionson Pattern Analysis and Machine Intelligence, 31:824–840,2009.

[22] E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Will-sky. Depth from familiar objects: A hierarchical model for3d scenes. In CVPR, 2006. 2

[23] R. Szeliski. Bayesian modeling of uncertainty in low-levelvision. 5(3):271–302, 1990. 2

[24] M. F. Tappen, C. Liu, E. H. Adelson, and W. T. Freeman.Learning gaussian conditional random fields for low-level vi-sion. In CVPR, 2007. 1, 2

[25] B. Taskar, C. Guestrin, and D. Koller. Max-margin markovnetworks. In NIPS, 2004. 2

[26] A. Torralba and A. Oliva. Statistics of natural image cate-gories. Comput. Neural Syst., 14:391412, 2003. 1, 2

[27] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Al-tun. Large margin methods for structured and interdependentoutput variables. Journal of Machine Learning Research(JMLR), 6:1453–1484, 2005. 2, 3, 4

[28] Y. Weiss and W. T. Freeman. What makes a good model ofnatural images? CVPR, 2007. 1, 2

[29] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyra-mid matching using sparse coding for image classification.In CVPR, 2009. 2

[30] J. Yang, K. Yu, and T. Huang. Supervised translation-invariant sparse coding. In CVPR, 2010. 2

[31] J. Zhu, E. P. Xing, and B. Zhang. Laplace maximum marginmarkov networks. In ICML, 2008. 2