Abstract arXiv:1609.03677v2 [cs.CV] 9 Dec 2016arXiv:1609.03677v2 [cs.CV] 9 Dec 2016 not scale to large output resolutions [51]. We improve upon these methods with a novel training

Unsupervised Monocular Depth Estimation with Left-Right Consistency

Clement Godard Oisin Mac Aodha Gabriel J. BrostowUniversity College London

http://visual.cs.ucl.ac.uk/pubs/monoDepth/

Abstract

Learning based methods have shown very promising resultsfor the task of depth estimation in single images. However,most existing approaches treat depth prediction as a supervisedregression problem and as a result, require vast quantitiesof corresponding ground truth depth data for training. Justrecording quality depth data in a range of environments is achallenging problem. In this paper, we innovate beyond existingapproaches, replacing the use of explicit depth data duringtraining with easier-to-obtain binocular stereo footage.

We propose a novel training objective that enables our convo-lutional neural network to learn to perform single image depthestimation, despite the absence of ground truth depth data. Ex-ploiting epipolar geometry constraints, we generate disparityimages by training our network with an image reconstructionloss. We show that solving for image reconstruction alone re-sults in poor quality depth images. To overcome this problem,we propose a novel training loss that enforces consistency be-tween the disparities produced relative to both the left and rightimages, leading to improved performance and robustness com-pared to existing approaches. Our method produces state of theart results for monocular depth estimation on the KITTI drivingdataset, even outperforming supervised methods that have beentrained with ground truth depth.

1. IntroductionDepth estimation from images has a long history in computer

vision. Fruitful approaches have relied on structure from motion,shape-from-X, binocular, and multi-view stereo. However, mostof these techniques rely on the assumption that multiple obser-vations of the scene of interest are available. These can comein the form of multiple viewpoints, or observations of the sceneunder different lighting conditions. To overcome this limitation,there has recently been a surge in the number of works that posethe task of monocular depth estimation as a supervised learningproblem [31, 10, 35]. These methods attempt to directly predictthe depth of each pixel in an image using models that have beentrained offline on large collections of ground truth depth data.While these methods have enjoyed great success, to date they

Figure 1. Our depth prediction results on KITTI 2015. Top to bottom:input image, ground truth disparities, and our result. Our method isable to estimate depth for thin structures such as street signs and poles.

have been restricted to scenes where large image collectionsand their corresponding pixel depths are available.

Understanding the shape of a scene from a single image,independent of its appearance, is a fundamental problem inmachine perception. There are many applications such assynthetic object insertion in computer graphics [28], syntheticdepth of field in computational photography [2], graspingin robotics [33], using depth as a cue in human body poseestimation [46], robot assisted surgery [47], and automatic 2Dto 3D conversion in film [51]. Accurate depth data from oneor more cameras is also crucial for self-driving cars, whereexpensive laser-based systems are often used.

Humans perform well at monocular depth estimation byexploiting cues such as perspective, scaling relative to theknown size of familiar objects, appearance in the form oflighting and shading and occlusion [23]. This combination ofboth top-down and bottom-up cues appears to link full sceneunderstanding with our ability to accurately estimate depth. Inthis work, we take an alternative approach and treat automaticdepth estimation as an image reconstruction problem duringtraining. Our fully convolutional model does not require anydepth data, and is instead trained to synthesize depth as anintermediate. It learns to predict the pixel-level correspondencebetween pairs of rectified stereo images that have a knowncamera baseline. There are some existing methods that alsoaddress the same problem, but with several limitations. Forexample they are not fully differentiable, making trainingsuboptimal [16], or have image formation models that do

1

arX

iv:1

609.

0367

7v2

[cs

.CV

] 9

Dec

201

6

not scale to large output resolutions [51]. We improve uponthese methods with a novel training objective and enhancednetwork architecture that significantly increases the qualityof our final results. An example result from our algorithm isillustrated in Fig. 1. Our method is fast and only takes on theorder of 50 milliseconds to predict a dense depth map for a768×256 image on a modern GPU. Specifically, we proposethe following contributions:1) A network architecture that performs end-to-end unsuper-vised monocular depth estimation with a novel training loss thatenforces left-right depth consistency inside the network.2) An evaluation of several training losses and image formationmodels highlighting the effectiveness of our approach.3) In addition to showing state of the art results on a challengingdriving dataset, we also show that our model generalizes to threedifferent datasets, including a new outdoor urban dataset thatwe have collected ourselves, which we make openly available.

2. Related WorkThere is a large body of work that focuses on depth

estimation from images, either using pairs [44], severaloverlapping images captured from different viewpoints [14],temporal sequences [42], or assuming a fixed camera, staticscene, and changing lighting [50, 1]. These approaches aretypically only applicable when there is more than one inputimage available of the scene of interest. Here we focus onworks related to monocular depth estimation, where there isonly a single input image, and no assumptions about the scenegeometry or types of objects present are made.

Learning Based Stereo

The vast majority of stereo estimation algorithms have a dataterm which computes the similarity between each pixel in thefirst image and every other pixel in the second image. Typicallythe stereo pair is rectified and thus the problem of disparity (i.e.scaled inverse depth) estimation can be posed as a 1D searchproblem for each pixel. Recently, it has been shown that insteadof using hand defined similarity measures, treating the matchingas a supervised learning problem and training a function topredict the correspondences produces far superior results[52, 30]. It has also been shown that posing this binocularcorrespondence search as a multi-class classification problemhas advantages both in terms of quality of results and speed[37]. Instead of just learning the matching function, Mayeret al. [38] introduced a fully convolutional [45] deep networkcalled DispNet that directly computes the correspondencefield between two images. At training time, they attempt todirectly predict the disparity for each pixel by minimizing aregression training loss. DispNet has a similar architecture totheir previous end-to-end deep optical flow network [12].

The above methods rely on having large amounts of accurateground truth disparity data and stereo image pairs at trainingtime. This type of data can be difficult to obtain for real world

scenes, so these approaches typically use synthetic data fortraining. Synthetic data is becoming more realistic, e.g. [15],but still requires the manual creation of new content for everynew application scenario.

Supervised Single Image Depth Estimation

Single-view, or monocular, depth estimation refers to theproblem setup where only a single image is available at testtime. Saxena et al. [43] proposed a patch-based model knownas Make3D that first over-segments the input image into patchesand then estimates the 3D location and orientation of localplanes to explain each patch. The predictions of the planeparameters are made using a linear model trained offline ona dataset of laser scans, and the predictions are then combinedtogether using an MRF. The disadvantage of this method, andother planar based approximations, e.g. [21], is that they canhave difficulty modeling thin structures and, as predictionsare made locally, lack the global context required to generaterealistic outputs. Instead of hand-tuning the unary and pairwiseterms, Liu et al. [35] use a convolutional neural network (CNN)to learn them. In another local approach, Ladicky et al. [31]incorporate semantics into their model to improve their perpixel depth estimation. Karsch et al. [27] attempt to producemore consistent image level predictions by copying whole depthimages from a training set. A drawback of this approach is thatit requires the entire training set to be available at test time.

Eigen et al. [10, 9] showed that it was possible to producedense pixel depth estimates using a two scale deep networktrained on images and their corresponding depth values. Unlikemost other previous work in single image depth estimation,they do not rely on hand crafted features or an initial over-segmentation and instead learn a representation directly fromthe raw pixel values. Several works have built upon the successof this approach using techniques such as CRFs to improve accu-racy [34], changing the loss from regression to classification [4],using other more robust loss functions [32], and incorporatingstrong scene priors in the case of the related problem of surfacenormal estimation [48]. Again, like the previous stereo methods,these approaches rely on having high quality, pixel aligned,ground truth depth at training time. We too perform singledepth image estimation, but train with an added binocular colorimage, instead of requiring ground truth depth.

Unsupervised Depth Estimation

Recently, a small number of deep network based methods fornovel view synthesis and depth estimation have been proposed,which do not require ground truth depth at training time. Flynn etal. [13] introduced a novel image synthesis network called Deep-Stereo that generates new views by selecting pixels from nearbyimages. During training, the relative pose of multiple camerasis used to predict the appearance of a held-out nearby image.Then the most appropriate depths are selected to sample colorsfrom the neighboring images, based on plane sweep volumes.

At test time, image synthesis is performed on small overlappingpatches. As it requires several nearby posed images at test timeDeepStereo is not suitable for monocular depth estimation.

The Deep3D network of Xie et al. [51] also addresses theproblem of novel view synthesis, where their goal is to generatethe corresponding right view from an input left image (i.e. thesource image) in the context of binocular pairs. Again using animage reconstruction loss, their method produces a distributionover all the possible disparities for each pixel. The resultingsynthesized right image pixel values are a combination of thepixels on the same scan line from the left image, weightedby the probability of each disparity. The disadvantage oftheir image formation model is that increasing the numberof candidate disparity values greatly increases the memoryconsumption of the algorithm, making it difficult to scale theirapproach to bigger output resolutions. In this work, we performa comparison to the Deep3D image formation model, and showthat our algorithm produces superior results.

Closest to our model in spirit is the concurrent work ofGarg et al. [16]. Like Deep3D and our method, they traina network for monocular depth estimation using an imagereconstruction loss. However, their image formation model isnot fully differentiable. To compensate, they perform a Taylorapproximation to linearize their loss resulting in an objectivethat is more challenging to optimize. Similar to other recentwork, e.g. [41, 54, 55], our model overcomes this problem byusing bilinear sampling [26] to generate images, resulting ina fully (sub-)differentiable training loss.

We propose a fully convolutional deep neural networkloosely inspired by the supervised DispNet architecture ofMayer et al. [38]. By posing monocular depth estimationas an image reconstruction problem, we can solve for thedisparity field without requiring ground truth depth. However,only minimizing a photometric loss can result in good qualityimage reconstructions but poor quality depth. Among otherterms, our fully differentiable training loss includes a left-rightconsistency check to improve the quality of our synthesizeddepth images. This type of consistency check is commonlyused as a post-processing step in many stereo methods, e.g.[52], but we incorporate it directly into our network.

3. MethodThis section describes our single image depth prediction

network. We introduce a novel depth estimation training loss,featuring an inbuilt left-right consistency check, which enablesus to train on image pairs without requiring supervision in theform of ground truth depth.

3.1. Depth Estimation as Image Reconstruction

Given a single image I at test time, our goal is to learn afunction f that can predict the per-pixel scene depth, d=f(I).Most existing learning based approaches treat this as asupervised learning problem, where they have color input

H x W x D

2H x 2W x D/2

UC

C

C

S

S

S

S

US

US

C

SC

Figure 2. Our loss module outputs left and right disparity maps, dl

and dr. The loss combines smoothness, reconstruction, and left-rightdisparity consistency terms. This same module is repeated at each ofthe four different output scales. C: Convolution, UC: Up-Convolution,S: Bilinear Sampling, US: Up-Sampling, SC: Skip Connection.

images and their corresponding target depth values at training. Itis presently not practical to acquire such ground truth depth datafor a large variety of scenes. Even expensive hardware, suchas laser scanners, can be imprecise in natural scenes featuringmovement and reflections. As an alternative, we instead posedepth estimation as an image reconstruction problem duringtraining. The intuition here is that, given a calibrated pair ofbinocular cameras, if we can learn a function that is able toreconstruct one image from the other, then we have learnedsomething about the 3D shape of the scene that is being imaged.

Specifically, at training time, we have access to two imagesIl and Ir, corresponding to the left and right color images froma calibrated stereo pair, captured at the same moment in time.Instead of trying to directly predict the depth, we attempt tofind the dense correspondence field dr that, when applied to theleft image, would enable us to reconstruct the right image. Wewill refer to the reconstructed image Il(dr) as Ir. Similarly, wecan also estimate the left image given the right one, Il=Ir(dl).Assuming that the images are rectified [19], d corresponds tothe image disparity - a scalar value per pixel that our modelwill learn to predict. Given the baseline distance b between thecameras and the camera focal length f , we can then triviallyrecover the depth d from the predicted disparity, d=bf/d.

3.2. Depth Estimation Network

At a high level, our network estimates depth by inferringthe disparities that warp the left image to match the right one.The key insight of our method is that we can simultaneouslyinfer both disparities (left-to-right and right-to-left), using onlythe left input image, and obtain better depths by enforcing themto be consistent with each other.

Our network generates the predicted image with backwardmapping using a bilinear sampler, resulting in a fully differen-tiable image formation model. As illustrated in Fig. 3, naıvelylearning to generate the right image by sampling from the left

Target

Output

Disparity

Sampler

CNN

Naïve

Input

No LR Ours

Figure 3. Sampling strategies for backward mapping. With naıvesampling the CNN produces a disparity map aligned with the targetinstead of the input. No LR corrects for this, but suffers from artifacts.Our approach uses the left image to produce disparities for bothimages, improving quality by enforcing mutual consistency.

one will produce disparities aligned with the right image (target).However, we want the output disparity map to align with theinput left image, meaning the network has to sample from theright image. We could instead train the network to generate theleft view by sampling from the right image, thus creating a leftview aligned disparity map (No LR in Fig. 3). While this aloneworks, the inferred disparities exhibit ‘texture-copy’ artifacts anderrors at depth discontinuities as seen in Fig. 5. We solve this bytraining the network to predict the disparity maps for both viewsby sampling from the opposite input images. This still onlyrequires a single left image as input to the convolutional layersand the right image is only used during training (Ours in Fig. 3).Enforcing consistency between both disparity maps using thisnovel left-right consistency cost leads to more accurate results.

Our fully convolutional architecture is inspired by Disp-Net [38], but features several important modifications thatenable us to train without requiring ground truth depth. Ournetwork, in Table 4, is composed of two main parts - an encoder(from cnv1 to cnv7b) and decoder (from upcnv7). The decoderuses skip connections [45] from the encoder’s activation blocks,enabling it to resolve higher resolution details. We outputdisparity predictions at four different scales (disp4 to disp1),which double in spatial resolution at each of the subsequentscales. Even though it only takes a single image as input, ournetwork predicts two disparity maps at each output scale -left-to-right and right-to-left.

3.3. Training Loss

We define a loss Cs at each output scale s, forming thetotal loss as the sum C=

∑4s=1Cs. Our loss module (Fig. 2)

computes Cs as a combination of three main terms,

Cs=αap(Clap+C

rap)+αds(C

lds+C

rds)+αlr(C

llr+C

rlr),

(1)

where Cap encourages the reconstructed image to appearsimilar to the corresponding training input, Cds enforcessmooth disparities, and Clr prefers the predicted left and rightdisparities to be consistent. Each of the main terms containsboth a left and a right image variant, but only the left imageis fed through the convolutional layers.

Next, we present each component of our loss in terms of theleft image (e.g.Cl

ap). The right image versions, e.g.Crap, require

to swap left for right and to sample in the opposite direction.

Appearance Matching Loss

During training, the network learns to generate an image bysampling pixels from the opposite stereo image. Our imageformation model uses the image sampler from the spatialtransformer network (STN) [26] to sample the input imageusing a disparity map. The STN uses bilinear sampling wherethe output pixel is the weighted sum of four input pixels. Incontrast to alternative approaches [16, 51], the bilinear samplerused is locally fully differentiable and integrates seamlessly intoour fully convolutional architecture. This means that we do notrequire any simplification or approximation of our cost function.

Inspired by [53], we use a combination of an L1 andsingle scale SSIM [49] term as our photometric imagereconstruction cost Cap, which compares the input image Ilijand its reconstruction Ilij, whereN is the number of pixels,

Clap=

1

N

∑i,j

α1−SSIM(Ilij,I

lij)

2+(1−α)

∥∥∥Ilij−Ilij∥∥∥. (2)

Here, we use a simplified SSIM with a 3×3 block filter insteadof a Gaussian, and set α=0.85.

Disparity Smoothness Loss

We encourage disparities to be locally smooth with an L1penalty on the disparity gradients ∂d. As depth discontinu-ities often occur at image gradients, similar to [20], we weightthis cost with an edge-aware term using the image gradients ∂I,

Clds=

1

N

∑i,j

∣∣∂xdlij∣∣e−‖∂xIlij‖+

∣∣∂ydlij∣∣e−‖∂yIlij‖. (3)

Left-Right Disparity Consistency Loss

To produce more accurate disparity maps, we train our networkto predict both the left and right image disparities, while onlybeing given the left view as input to the convolutional part ofthe network. To ensure coherence, we introduce an L1 left-rightdisparity consistency penalty as part of our model. This costattempts to make the left-view disparity map be equal to theprojected right-view disparity map,

Cllr=

1

N

∑i,j

∣∣∣dlij−drij+dlij

∣∣∣. (4)

Method Dataset Resolution Abs Rel Sq Rel RMSE RMSE log D1-all δ<1.25 δ<1.252 δ<1.253

Ours with Deep3D [51] K 128×384 0.412 16.37 13.693 0.512 66.85 0.690 0.833 0.891Ours with Deep3Ds [51] K 128×384 0.151 1.312 6.344 0.239 59.64 0.781 0.931 0.976Ours without LR K 256×768 0.110 1.774 5.617 0.188 20.88 0.896 0.962 0.982Ours K 256×768 0.098 1.218 5.265 0.175 19.49 0.899 0.964 0.984Ours CS 256×512 0.651 9.128 13.770 0.516 93.90 0.064 0.431 0.890Ours CS + K 256×512 0.097 1.265 5.243 0.175 18.55 0.904 0.966 0.985Ours with post-processing CS + K 256×512 0.090 1.000 4.895 0.164 18.05 0.909 0.969 0.987Ours Stereo K 256×768 0.118 3.155 5.746 0.197 10.58 0.931 0.967 0.980

Lower is better

Higher is better

Table 1. Comparison of different image formation models. Results on the KITTI 2015 stereo 200 training set disparity images [17]. For training,K is the KITTI dataset [17] and CS is Cityscapes [8]. Our model with left-right consistency performs the best, and is further improved with theaddition of the Cityscapes data. The last row shows the result of our model trained and tested with two input images instead of one (see Sec. 4.3).

Like all the other terms, this cost is mirrored for the right-viewdisparity map and is evaluated at all of the output scales.

At test time, our network predicts the disparity at the finestscale level for the left image dl (disp1 in Table 4), which has thesame resolution as the input image. Using the known camerabaseline and focal length from the training set, we then convertfrom the disparity map to a depth map. While we also estimatethe right disparity dr during training, it is not used at test time.

4. Results

Here we compare the performance of our approach to bothsupervised and unsupervised single view depth estimationmethods. We train on rectified stereo image pairs, and donot require any supervision in the form of ground truth depth.Existing single image datasets, such as [40, 43], that lackstereo pairs, are not suitable for evaluation. Instead we evaluateour approach using the popular KITTI 2015 [17] dataset. Toevaluate our image formation model, we compare to a variantof our algorithm that uses the original Deep3D [51] imageformation model and a modified one, Deep3Ds, with an addedsmoothness constraint. We also evaluate our approach with andwithout the left-right consistency constraint.

4.1. Implementation Details

Our network architecture is outlined in Table 4 and consistsof 31 million parameters that we learn. The network is imple-mented in Torch [7] and takes on the order of 20 hours to trainusing 2 NVIDIA GeForce GTX TITAN X GPUs on a datasetof 30 thousand images for 50 epochs. Inference is fast andtakes less than 50 ms (or more than 20 frames per second) for a768×256 image, including transfer times to and from the GPU.

During optimization, we set the weighting of the differentloss components toαap=1 andαlr=1. The possible output dis-parities are constrained to be between 0 and dmax using a scaledsigmoid non-linearity, where dmax=0.3× the image width at agiven output scale. As a result of our multi-scale output, the typ-ical disparity of neighboring pixels will differ by a factor of twobetween each scale (as we are upsampling the output by a factorof two). To correct for this, we scale the disparity smoothnessterm αds with r for each scale to get equivalent smoothing ateach level. Thus αds=0.1/r, where r is the downscaling factor

of the corresponding layer with respect to the resolution of theinput image that is passed into the network (in from Table 4).

For the non-linearities in the network, we used exponentiallinear units [6] instead of the commonly used rectified linerunits (ReLU) [39]. We found that ReLUs tended to prematurelyfix the predicted disparities at intermediate scales to a singlevalue, making subsequent improvement difficult. We trained ourmodel from scratch for 50 epochs, with a batch size of 8 usingAdam [29], where β1=0.9, β2=0.999, and ε=10−8. We usedan initial learning rate of λ=10−4 which we kept constant forthe first 30 epochs before halving it every 10 epochs until the end.We initially experimented with progressive update schedules,as in [38], where lower resolution image scales were optimizedfirst. However, we found that optimizing all four scales at onceled to more stable convergence. Similarly, we use an identicalweighting for the loss of each scale as we found that weightingthem differently led to unstable convergence. We experimentedwith batch normalization [25], but found that it did not producea significant improvement, and ultimately excluded it.

Data augmentation is performed on the fly. We flip the inputimages horizontally with a 50% chance, taking care to alsoswap both images so they are in the correct position relativeto each other. For color augmentations, we performed randomgamma, brightness, and color shifts by sampling from uniformdistributions in the ranges [0.8,1.2] for gamma, [0.5,2.0] forbrightness, and [0.8,1.2] for each color channel separately.

Post-processing In order to reduce the effect of stereodisocclusions which create disparity ramps on both the leftside of the image and of the occluders, a final post-processingstep is performed on the output. For an input image I at testtime, we also compute the disparity map d′l for its horizontallyflipped image I′. By flipping back this disparity map weobtain a disparity map d′′l , which aligns with dl but where thedisparity ramps are located on the right of occluders as wellas on the right side of the image. We combine both disparitymaps to form the final result by assigning the first 5% on theleft of the image using d′′l and the last 5% on the right to thedisparities from dl. The central part of the final disparity mapis the average of dl and d′l. This final post-processing step leadsto both better accuracy and less visual artifacts.

Input GT Eigen et al. [10] Liu et al. [35] Garg et al. [16] Ours

Figure 4. Qualitative results on the KITTI Eigen Split. The ground truth velodyne depth being very sparse, we interpolate it for visualizationpurposes. Our method does better at resolving small objects such as the pedestrians and poles.

Method Supervised Dataset Abs Rel Sq Rel RMSE RMSE log δ<1.25 δ<1.252 δ<1.253

Train set mean No K 0.403 5.530 8.709 0.403 0.593 0.776 0.878Eigen et al. [10] Coarse ◦ Yes K 0.206 1.531 6.414 0.269 0.689 0.891 0.960Eigen et al. [10] Fine ◦ Yes K 0.197 1.485 6.179 0.260 0.712 0.895 0.961Liu et al. [35] DCNF-FCSP FT * Yes K 0.217 1.841 6.986 0.289 0.647 0.882 0.961Ours without LR No K 0.150 2.062 6.220 0.252 0.819 0.928 0.964Ours No K 0.141 1.369 5.849 0.242 0.818 0.929 0.966Ours No CS + K 0.136 1.512 5.763 0.236 0.836 0.935 0.968Ours with post-processing No CS + K 0.126 1.161 5.381 0.224 0.843 0.941 0.972Garg et al. [16] L12 Aug 8× cap 50m No K 0.169 1.080 5.104 0.273 0.740 0.904 0.962Ours cap 50m No K 0.123 0.944 5.061 0.221 0.843 0.942 0.972Ours cap 50m No CS + K 0.118 0.932 4.941 0.215 0.858 0.947 0.974Ours cap 50m with post-processing No CS + K 0.112 0.783 4.753 0.205 0.864 0.951 0.978

Lower is better

Higher is better

Table 2. Results on KITTI 2015 [17] using the split of Eigen et al. [10]. For training, K is the KITTI dataset [17] and CS is Cityscapes [8]. The predic-tions of Liu et al. [35]* are generated on a mix of the left and right images instead of just the left input images. For a fair comparison, we recomputetheir results using the correct image. As in the provided source code, Eigen et al. [10]◦ results are computed relative to the velodyne instead of thecamera. Garg et al. [16] results are taken directly from their paper. We also show our results with the same crop and maximum evaluation distance.

4.2. KITTI

We present results for the KITTI dataset [17] using twodifferent test splits, to enable comparison to existing works. Inits raw form, the dataset contains 42,382 images, with a typicalimage being 1242×375 pixels in size. As no official test splitexists for the raw data, and to ensure no overlap between thetraining set and the test images, we remove frames that arewithin±5 frames of a given test image. As the car capturing thedata is usually moving while capturing images at 10 frames persecond, this greatly reduces the overlap between the two splits.

KITTI Split First we compare different variants of ourmethod including different image formation models anddifferent training sets. We evaluate on the 200 high qualitydisparity images provided as part of the official KITTI trainingset. While these disparity images are much better quality thanthe reprojected velodyne laser depth values, they have CADmodels inserted in place of moving cars. These CAD modelsresult in ambiguous disparity values on transparent surfacessuch as car windows, and issues at object boundaries wherethe CAD models do not perfectly align with the images. Ourevaluation is performed on the full input image resolution, andthe total number of training images is 40,046. In addition, themaximum depth present in the KITTI dataset is on the order of80 meters, and we cap the maximum predictions of all networksto this value. Results are computed using the depth metricsfrom [10] along with the D1-all disparity error from KITTI

[17]. The metrics from [10] measure error in both meters fromthe ground truth and the percentage of depths that are withinsome threshold from the correct value. It is important to notethat measuring the error in depth space while the ground truthis given in disparities leads to precision issues. In particular, thenon-thresholded measures can be sensitive to the large errorsin depth caused by prediction errors at small disparity values.

In Table 1, we see that in addition to having poor scaling prop-erties (in terms of both resolution and the number of disparities itcan represent), when trained from scratch with the same networkarchitecture as ours, the Deep3D [51] image formation modelperforms poorly. From Fig. 6 we can see that Deep3D producesplausible image reconstructions but the output disparities are in-ferior to ours. Our loss outperforms both the Deep3D baselinesand the addition of the left-right consistency check increases per-formance in all measures. In Fig. 5 we illustrate some zoomedin comparisons, clearly showing that the inclusion of the left-

Figure 5. Comparison between our method with and without the left-right consistency. Our consistency term produces superior results onthe object boundaries. Both results are shown without post-processing.

Inpu

t

Dee

p3D

Dee

p3D

sO

urs

Reconstruction error (x2) DisparitiesFigure 6. Image reconstruction error on KITTI. While all methodsoutput plausible right views, the Deep3D image formation modelwithout smoothness constraints does not produce valid disparities.

right check improves the visual quality of the results. Our resultsare further improved by first pre-training our model with addi-tional training data from the Cityscapes dataset [8] containing22,973 training stereo pairs captured in various cities across Ger-many. This dataset brings higher resolution, image quality, andvariety compared to KITTI, while having a similar setting. Wecropped the input images to only keep the top 80% of the image,removing the very reflective car hoods from the input. Interest-ingly, our model trained on Cityscapes alone does not performvery well numerically. This is likely due to the difference incamera calibration between the two datasets, but there is a clearadvantage to fine-tuning on data that is related to the test set.

Eigen Split To be able to compare to existing work, we alsouse the test split of 697 images as proposed by [10] whichresults in 33,131 images for training. To generate the groundtruth depth images, we reproject the 3D points viewed from thevelodyne laser into the left input color camera. Aside from onlyproducing depth values for less than 5% of the pixels in theinput image, errors are also introduced because of the rotationof the Velodyne, the motion of the vehicle and surroundingobjects, and also incorrect depth readings due to occlusion atobject boundaries. To be fair to all methods, we use the samecrop as [10] and evaluate at the input image resolution. With theexception of Garg et al.’s[16] results, the results of the baselinemethods are recomputed by us given the authors’s originalpredictions to ensure that all the scores are directly comparable.This produces slightly different numbers than the previouslypublished ones, e.g. in the case of [10], their predictions wereevaluated on much smaller depth images (1/4 the original size).For all baseline methods we use bilinear interpolation to resizethe predictions to the correct input image size.

Table 2 shows quantitative results with some exampleoutputs shown in Fig. 4. We see that our algorithm outperformsall other existing methods, including those that are trained withground truth depth data. We again see that pre-training on theCityscapes dataset improves the results over using KITTI alone.

Method Sq Rel Abs Rel RMSE RMSE logTrain set mean* 13.98 0.876 12.27 0.307Karsch et al. [27]* 5.079 0.428 8.389 0.149Liu et al. [36]* 6.562 0.475 10.05 0.165Laina et al. [32] berHu* 1.840 0.204 5.683 0.084Ours with Deep3D [51] 17.18 1.000 19.11 2.527Ours 10.94 0.544 11.76 0.193Ours with post-processing 7.140 0.469 9.664 0.183

Table 3. Results on the Make3D dataset [43]. All methods markedwith an * are supervised and use ground truth depth data from theMake3D training set. Using the standard C1 metric, errors are onlycomputed where depth is less than 70 meters in a central image crop.

4.3. Stereo

We also implemented a stereo version of our architecture,where instead of only feeding the left image to the convolutionallayers as input, we use both left and right views, resulting ina six channel input. While the depth errors in Table 1 are worsefor some of the error measures than in the single image case,the stereo model performs significantly better on the D1-alldisparity measure. The higher errors of the stereo model onthe distance based metrics can be explained by its tendencyto sometimes incorrectly estimate the depths of low textureregions such as roads, as can be seen in Fig. 8.

4.4. Make3D

To illustrate that our method can generalize to other datasets,here we compare to several fully supervised methods on theMake3D test set of [43]. Make3D consists of only RGB/Depthpairs and no stereo images, thus our method cannot train on thisdata. We use our network trained only on the Cityscapes datasetand despite the dissimilarities in the datasets, both in content andcamera parameters, we still achieve reasonable results. Due tothe different aspect ratio of the Make3D dataset we evaluate on acentral crop of the images. In Table 3, we compare our output tothe similarly cropped results of the other methods. As in the caseof the KITTI dataset, these results would likely be improvedwith more relevant training data. A qualitative comparison tosome of the related methods is shown in Fig. 7. While ournumerical results are not as good as the baselines, qualitatively,we compare favorably to the supervised competition.

4.5. Generalizing to Other Datasets

Finally, we illustrate some further examples of our modelgeneralizing to other datasets in Figure 9. Using the model onlytrained on Cityscapes [8], we tested on the CamVid drivingdataset [3]. In the accompanying video and the supplementarymaterial we can see that despite the differences in location,image characteristics, and camera calibration, our model stillproduces visually plausible depths. We also captured a 60,000frame dataset, at 10 frames per second, taken in an urbanenvironment with a wide angle consumer 1080p stereo camera.Finetuning the Cityscapes pre-trained model on this dataset

Input GT Karsch et al. [27] Liu et al. [36] Laina et al. [32] Ours

Figure 7. Our method achieves superior qualitative results on Make3D despite being trained on a different dataset (Cityscapes).

Inpu

tlef

tSt

ereo

Mon

o

Figure 8. Our stereo results. Stereo disparity maps contain details suchas the road barrier in the first image, but as seen in the last column,they sometimes exhibit artifacts in low textured regions such as thehole in the road.

produces visually convincing depth images for a test set thatwas captured with the same camera on a different day, pleasesee the video in the supplementary material for more results.

4.6. Limitations

Even though both our left-right consistency check andpost-processing improve the quality of the results, there arestill some artifacts visible at occlusion boundaries due to thepixels in the occlusion region not being visible in both images.It may be possible to improve some of these issues by moreexplicitly reasoning about occlusion during training [22, 24]. Itis worth noting that depending how large the baseline betweenthe camera and the depth sensor, fully supervised approachesalso do not always have valid depth for all pixels.

While our method only needs a single image at test time topredict a scene’s depth, it does require rectified and temporally

Figure 9. Qualitative results on Cityscapes, CamVid, and our ownurban dataset captured on foot. For more results please see our video.

“Encoder”layer k s chns in out inputcnv1 7 2 3/32 1 2 leftcnv1b 7 1 32/32 2 2 cnv1cnv2 5 2 32/64 2 4 cnv1bcnv2b 5 1 64/64 4 4 cnv2cnv3 3 2 64/128 4 8 cnv2bcnv3b 3 1 128/128 8 8 cnv3cnv4 3 2 128/256 8 16 cnv3bcnv4b 3 1 256/256 16 16 cnv4cnv5 3 2 256/512 16 32 cnv4bcnv5b 3 1 512/512 32 32 cnv5cnv6 3 2 512/512 32 64 cnv5bcnv6b 3 1 512/512 64 64 cnv6cnv7 3 2 512/512 64 128 cnv6bcnv7b 3 1 512/512 128 128 cnv7

“Decoder”layer k s chns in out inputupcnv7 3 2 512/512 128 64 cnv7bicnv7 3 1 1024/512 64 64 upcnv7+cnv6bupcnv6 3 2 512/512 64 32 icnv7icnv6 3 1 1024/512 32 32 upcnv6+cnv5bupcnv5 3 2 512/256 32 16 icnv6icnv5 3 1 512/256 16 16 upcnv5+cnv4bupcnv4 3 2 256/128 16 8 icnv5icnv4 3 1 128/128 8 8 upcnv4+cnv3bdisp4 3 1 128/2 8 8 icnv4upcnv3 3 2 128/64 8 4 icnv4icnv3 3 1 130/64 4 4 upcnv3+cnv2b+disp4*disp3 3 1 64/2 4 4 icnv3upcnv2 3 2 64/32 4 2 icnv3icnv2 3 1 66/32 2 2 upcnv2+cnv1b+disp3*disp2 3 1 32/2 2 2 icnv2upcnv1 3 2 32/16 2 1 icnv2icnv1 3 1 18/16 1 1 upcnv1+disp2*disp1 3 1 16/2 1 1 icnv1

Table 4. Our network architecture, where k is the kernel size, s thestride, chns the number of input and output channels for each layer,input and output is the downscaling factor for each layer relative to theinput image, and input corresponds to the input of each layer where+ is a concatenation and ∗ is a 2× upsampling of the layer.

aligned stereo pairs during training. This means that it iscurrently not possible to use existing single-view datasets fortraining purposes e.g. [40]. However, it is possible to fine-tuneour model on application specific ground truth depth data.

Finally, our method mainly relies on the image reconstruc-tion term, meaning that specular [18] and transparent surfaceswill produce inconsistent depths. This could be improved withmore sophisticated similarity measures [52].

5. ConclusionWe have presented an unsupervised deep neural network for

single image depth estimation. Instead of using aligned groundtruth depth data, which is both rare and costly, we exploit theease with which binocular stereo data can be captured. Ournovel loss function enforces consistency between the predicteddepth maps from each camera view during training, leadingto improved predictions. Our results are superior to fullysupervised baselines, which is encouraging for future researchthat does not require expensive to capture ground truth depth.We have also shown that our model can generalize to unseendatasets and still produce visually plausible depth maps.

In future work, we would like to extend our model to videos.Our current depth estimates are performed independently perframe, adding temporal consistency [27] would likely bringimproved results. It would also be interesting to investigatesparse input as an alternative training signal [56, 5]. Finally,while our model estimates per pixel depth, for many applicationsit would be useful to also predict the full occupancy for bothvisible and occluded regions [11].

Acknowledgments We would like to thank David Eigen, Ravi Garg,Iro Laina and Fayao Liu for providing data and code to recreate thebaseline algorithms. We also thank Stephan Garbin for his invaluablehelp throughout this project and Peter Hedman for his LATEX magic. Weare grateful for EPSRC funding for the EngD Centre EP/G037159/1,and for projects EP/K015664/1 and EP/K023578/1.

References[1] A. Abrams, C. Hawley, and R. Pless. Heliometric stereo: Shape

from sun position. In ECCV, 2012. 2[2] J. T. Barron, A. Adams, Y. Shih, and C. Hernandez. Fast

bilateral-space stereo for synthetic defocus. CVPR, 2015. 1[3] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object

classes in video: A high-definition ground truth database. PatternRecognition Letters, 2009. 7

[4] Y. Cao, Z. Wu, and C. Shen. Estimating depth from monocularimages as classification using deep fully convolutional residualnetworks. arXiv preprint arXiv:1605.02305, 2016. 2

[5] W. Chen, Z. Fu, D. Yang, and J. Deng. Single-image depthperception in the wild. In NIPS, 2016. 8

[6] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast andaccurate deep network learning by exponential linear units (elus).arXiv preprint arXiv:1511.07289, 2015. 5

[7] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: Amatlab-like environment for machine learning. In BigLearn,NIPS Workshop, 2011. 5

[8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-nenson, U. Franke, S. Roth, and B. Schiele. The cityscapes datasetfor semantic urban scene understanding. In CVPR, 2016. 5, 6, 7

[9] D. Eigen and R. Fergus. Predicting depth, surface normalsand semantic labels with a common multi-scale convolutionalarchitecture. In ICCV, 2015. 2

[10] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction froma single image using a multi-scale deep network. In NIPS, 2014.1, 2, 6, 7

[11] M. Firman, O. Mac Aodha, S. Julier, and G. J. Brostow.Structured Prediction of Unobserved Voxels From a Single DepthImage. In CVPR, 2016. 9

[12] P. Fischer, A. Dosovitskiy, E. Ilg, P. Hausser, C. Hazırbas,V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. Flownet:Learning optical flow with convolutional networks. In ICCV,2015. 2

[13] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. Deepstereo:Learning to predict new views from the world’s imagery. InCVPR, 2016. 2

[14] Y. Furukawa and C. Hernandez. Multi-view stereo: A tutorial.Foundations and Trends in Computer Graphics and Vision, 2015.2

[15] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds asproxy for multi-object tracking analysis. In CVPR, 2016. 2

[16] R. Garg, V. Kumar BG, and I. Reid. Unsupervised CNN forsingle view depth estimation: Geometry to the rescue. In ECCV,2016. 1, 3, 4, 6, 7

[17] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomousdriving? the kitti vision benchmark suite. In CVPR, 2012. 5, 6

[18] C. Godard, P. Hedman, W. Li, and G. J. Brostow. Multi-viewreconstruction of highly specular surfaces in uncontrolledenvironments. In 3DV, 2015. 8

[19] R. Hartley and A. Zisserman. Multiple view geometry incomputer vision. Cambridge university press, 2003. 3

[20] P. Heise, S. Klose, B. Jensen, and A. Knoll. Pm-huber:Patchmatch with huber regularization for stereo matching. InICCV, 2013. 4

[21] D. Hoiem, A. A. Efros, and M. Hebert. Automatic photo pop-up.TOG, 2005. 2

[22] D. Hoiem, A. N. Stein, A. A. Efros, and M. Hebert. Recoveringocclusion boundaries from a single image. In ICCV, 2007. 8

[23] I. P. Howard. Perceiving in depth, volume 1: basic mechanisms.Oxford University Press, 2012. 1

[24] A. Humayun, O. Mac Aodha, and G. J. Brostow. Learning toFind Occlusion Regions. In CVPR, 2011. 8

[25] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. arXivpreprint arXiv:1502.03167, 2015. 5

[26] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu.Spatial transformer networks. In NIPS, 2015. 3, 4

[27] K. Karsch, C. Liu, and S. B. Kang. Depth transfer: Depthextraction from video using non-parametric sampling. PAMI,2014. 2, 7, 8

[28] K. Karsch, K. Sunkavalli, S. Hadap, N. Carr, H. Jin, R. Fonte,M. Sittig, and D. Forsyth. Automatic scene inference for 3dobject compositing. TOG, 2014. 1

[29] D. Kingma and J. Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014. 5

[30] L. Ladicky, C. Hane, and M. Pollefeys. Learning the matchingfunction. arXiv preprint arXiv:1502.00652, 2015. 2

[31] L. Ladicky, J. Shi, and M. Pollefeys. Pulling things out ofperspective. In CVPR, 2014. 1, 2

[32] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab.Deeper depth prediction with fully convolutional residualnetworks. In 3DV, 2016. 2, 7, 8

[33] I. Lenz, H. Lee, and A. Saxena. Deep learning for detectingrobotic grasps. The International Journal of Robotics Research,2015. 1

[34] B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He. Depth andsurface normal estimation from monocular images using regres-sion on deep features and hierarchical crfs. In CVPR, 2015. 2

[35] F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from singlemonocular images using deep convolutional neural fields. PAMI,2015. 1, 2, 6

[36] M. Liu, M. Salzmann, and X. He. Discrete-continuous depthestimation from a single image. In CVPR, 2014. 7, 8

[37] W. Luo, A. Schwing, and R. Urtasun. Efficient deep learningfor stereo matching. In CVPR, 2016. 2

[38] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovit-skiy, and T. Brox. A large dataset to train convolutional networksfor disparity, optical flow, and scene flow estimation. In CVPR,2016. 2, 3, 4, 5

[39] V. Nair and G. E. Hinton. Rectified linear units improve restrictedboltzmann machines. In ICML, 2010. 5

[40] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoorsegmentation and support inference from rgbd images. In ECCV,2012. 5, 8

[41] V. Patraucean, A. Handa, and R. Cipolla. Spatio-temporalvideo autoencoder with differentiable memory. arXiv preprintarXiv:1511.06309, 2015. 3

[42] R. Ranftl, V. Vineet, Q. Chen, and V. Koltun. Dense monoculardepth estimation in complex dynamic scenes. In CVPR, 2016. 2

[43] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3d scenestructure from a single still image. PAMI, 2009. 2, 5, 7

[44] D. Scharstein and R. Szeliski. A taxonomy and evaluation ofdense two-frame stereo correspondence algorithms. IJCV, 2002.2

[45] E. Shelhamer, J. Long, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. PAMI, 2016. 2, 4

[46] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio,A. Blake, M. Cook, and R. Moore. Real-time human poserecognition in parts from single depth images. Communicationsof the ACM, 2013. 1

[47] D. Stoyanov, M. V. Scarzanella, P. Pratt, and G.-Z. Yang.Real-time stereo reconstruction in robotically assisted minimallyinvasive surgery. In MICCAI, 2010. 1

[48] X. Wang, D. Fouhey, and A. Gupta. Designing deep networksfor surface normal estimation. In CVPR, 2015. 2

[49] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Imagequality assessment: from error visibility to structural similarity.Transactions on Image Processing, 2004. 4

[50] R. J. Woodham. Photometric method for determining surfaceorientation from multiple images. Optical engineering, 1980. 2

[51] J. Xie, R. Girshick, and A. Farhadi. Deep3d: Fully automatic2d-to-3d video conversion with deep convolutional neuralnetworks. In ECCV, 2016. 1, 2, 3, 4, 5, 6, 7

[52] J. Zbontar and Y. LeCun. Stereo matching by training aconvolutional neural network to compare image patches. JMLR,2016. 2, 3, 8

[53] H. Zhao, O. Gallo, I. Frosio, and J. Kautz. Is l2 a good lossfunction for neural networks for image processing? arXivpreprint arXiv:1511.08861, 2015. 4

[54] T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and A. A. Efros.Learning dense correspondence via 3d-guided cycle consistency.CVPR, 2016. 3

[55] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. Viewsynthesis by appearance flow. In ECCV, 2016. 3

[56] D. Zoran, P. Isola, D. Krishnan, and W. T. Freeman. Learningordinal relationships for mid-level vision. In ICCV, 2015. 8

Documents

Abstract arXiv:1609.03677v2 [cs.CV] 9 Dec 2016arXiv:1609.03677v2 [cs.CV] 9 Dec 2016 not scale to large output resolutions [51]. We improve upon these methods with a novel training