Semi-Supervised Deep Learning for Monocular Depth Map ... · Semi-Supervised Deep Learning for Monocular Depth Map Prediction ... Introduction Estimating depth from single images

Semi-Supervised Deep Learning for Monocular Depth Map Prediction

Yevhen Kuznietsov Jorg Stuckler Bastian LeibeComputer Vision Group, Visual Computing Institute, RWTH Aachen University

52074 Aachen, [email protected], stueckler | leibe @vision.rwth-aachen.de

Abstract

Supervised deep learning often suffers from the lack ofsufficient training data. Specifically in the context of monoc-ular depth map prediction, it is barely possible to deter-mine dense ground truth depth images in realistic dynamicoutdoor environments. When using LiDAR sensors, for in-stance, noise is present in the distance measurements, thecalibration between sensors cannot be perfect, and the mea-surements are typically much sparser than the camera im-ages. In this paper, we propose a novel approach to depthmap prediction from monocular images that learns in asemi-supervised way. While we use sparse ground-truthdepth for supervised learning, we also enforce our deepnetwork to produce photoconsistent dense depth maps in astereo setup using a direct image alignment loss. In exper-iments we demonstrate superior performance in depth mapprediction from single images compared to the state-of-the-art methods.

1. IntroductionEstimating depth from single images is an ill-posed

problem which cannot be solved directly from bottom-upgeometric cues in general. Instead, a-priori knowledgeabout the typical appearance, layout and size of objectsneeds to be used, or further cues such as shape from shadingor focus have to be employed which are difficult to model inrealistic settings. In recent years, supervised deep learningapproaches have demonstrated promising results for singleimage depth prediction. These learning approaches appearto capture the statistical relationship between appearanceand distance to objects well.

Supervised deep learning, however, requires vastamounts of training data in order to achieve high accu-racy and to generalize well to novel scenes. Supplementarydepth sensors are typically used to capture ground truth.In the indoor setting, active RGB-D cameras can be used.Outdoors, 3D laser scanners are a popular choice to capturedepth measurements. However, using such sensing devices

Figure 1. We concurrently train a CNN from unsupervised andsupervised depth cues to achieve state-of-the-art performance insingle image depth prediction. For supervised training we use(sparse) ground-truth depth readings from a supplementary sens-ing cue such as a 3D laser. Unsupervised direct image alignmentcomplements the ground-truth measurements with a training sig-nal that is purely based on the stereo images and the predicteddepth map for an image.

bears several shortcomings. Firstly, the sensors have theirown error and noise characteristics, which will be learnedby the network. In addition, when using 3D lasers, themeasurements are typically much sparser than the imagesand do not capture high detail depth variations visible inthe images well. Finally, accurate extrinsic and intrinsiccalibration of the sensors is required. Ground truth datacould alternatively be generated through synthetic render-ing of depth maps. The rendered images, however, do notfully realistically display the scene and do not incorporatereal image noise characteristics.

Very recently, unsupervised methods have been intro-duced [6, 9] that learn to predict depth maps directly fromthe intensity images in a stereo setup—without the needfor an additional supplementary modality for capturing theground truth. One drawback of these approaches is the well-

1

arX

iv:1

702.

0270

6v2

[cs

.CV

] 1

7 Fe

b 20

17

known fact that stereo depth reconstruction based on im-age matching is an ill-posed problem on its own. To thisend, common regularization schemes can be used which im-pose priors on the depth such as small depth gradient normswhich may not be fully satisfied in the real environment.

In this paper, we propose a semi-supervised learning ap-proach that makes use of supervised as well as unsupervisedtraining cues to incorporate the best of both worlds. Ourmethod benefits from ground-truth measurements as an un-ambiguous (but noisy and sparse) cue for the actual depth inthe scene. Unsupervised image alignment complements theground-truth by a huge amount of additional training datawhich is much simpler to obtain and counteracts the defi-ciencies of the ground-truth depth measurements. By thecombination of both methods, we achieve significant im-provements over the state-of-the-art in single image depthmap prediction which we evaluate on the popular KITTIdataset [7] in urban street scenes. We base our approachon a state-of-the-art deep residual network in an encoder-decoder architecture for this task [16] and augment it withlong skip connections between corresponding layers in en-coder and decoder to predict high detail output depth maps.Our network converges quickly to a good model from littlesupervised training data, mainly due to the use of pretrainedencoder weights (on ImageNet [22] classification task) andunsupervised training. The use of supervised training alsosimplifies unsupervised learning significantly. For instance,a tedious coarse-to-fine image alignment loss as in previousunsupervised learning approaches [6] is not required in oursemi-supervised approach.

In summary, we make the following contributions: 1) Wepropose a novel semi-supervised deep learning approach tosingle image depth map prediction that uses supervised aswell as unsupervised learning cues. 2) Our deep learningapproach demonstrates state-of-the-art performance in chal-lenging outdoor scenes on the KITTI benchmark.

2. Related WorkOver the last years, several learning-based approaches to

single image depth reconstruction have been proposed thatare trained in a supervised way. Often, measured depth fromRGB-D cameras or 3D laser scanners is used as ground-truth for training. Saxena et al. [24] proposed one of the firstsupervised learning-based approaches to single image depthmap prediction. They model depth prediction in a Markovrandom field and use multi-scale texture features that havebeen hand-crafted. The method also combines monocularcues with stereo correspondences within the MRF.

Many recent approaches learn image features using deeplearning techniques. Eigen et al. [5] propose a CNN ar-chitecture that integrates coarse-scale depth prediction withfine-scale prediction. The approach of Li et al. [17] com-bines deep learning features on image patches with hierar-

chical CRFs defined on a superpixel segmentation of theimage. They use pretrained AlexNet [14] features of im-age patches to predict depth at the center of the superpix-els. A hierarchical CRF refines the depth across individ-ual pixels. Liu et al. [20] also propose a deep structuredlearning approach that avoids hand-crafted features. Theirdeep convolutional neural fields allow for training CNN fea-tures of unary and pairwise potentials end-to-end, exploit-ing continuous depth and Gaussian assumptions on the pair-wise potentials. Very recently, Laina et al. [16] proposedto use a ResNet-based encoder-decoder architecture to pro-duce dense depth maps. They demonstrate the approach topredict depth maps in indoor scenes using RGB-D imagesfor training. Further lines of research in supervised train-ing of depth map prediction use the idea of depth transferfrom example images [13, 12, 21], or integrate depth mapprediction with semantic segmentation [15, 19, 4, 26, 18].

Only few very recent methods attempt to learn depthmap prediction in an unsupervised way. Garg et al. [6] pro-pose an encoder-decoder architecture similar to FlowNet [3]which is trained to predict single image depth maps on animage alignment loss. The method only requires images ofa corresponding camera in a stereo setup. The loss quan-tifies the photometric error of the input image warped intoits corresponding stereo image using the predicted depth.The loss is linearized using first-order Taylor approxima-tion and hence requires coarse-to-fine training. Xie et al.[27] do not regress the depth maps directly, but produceprobability maps for different disparity levels. A selec-tion layer then reconstructs the right image using the leftimage and these probability maps. The network is trainedto minimize pixel-wise reconstruction error. Godard et al.[9] also use an image alignment loss in a convolutionalencoder-decoder architecture but additionally enforce left-right consistency of the predicted disparities in the stereopair. Our semi-supervised approach simplifies the use ofunsupervised cues and does not require multi-scale depthmap prediction in our network architecture. We also do notexplicitly enforce left-right consistency, but use both im-ages in the stereo pair equivalently to define our loss func-tion. The semi-supervised method of Chen et al. [1] in-corporates the side-task of depth ranking of pairs of pixelsfor training a CNN on single image depth prediction. Forthe ranking task, ground-truth is much easier to obtain butonly indirectly provides information on continuous depthvalues. Our approach uses image alignment as a geometriccue which does not require manual annotations.

3. ApproachWe base our approach on supervised as well as unsu-

pervised principles for learning single image depth mapprediction (see Fig. 1). A straight-forward approach is touse a supplementary measuring device such as a 3D laser

2

Figure 2. Components and inputs of our novel semi-supervised loss function.

in order to capture ground-truth depth readings for super-vised training. This process typically requires an accurateextrinsic calibration between the 3D laser sensor and thecamera. Furthermore, the laser measurements have severalshortcomings. Firstly, they are affected by erroneous read-ings and noise. They are also typically much sparser thanthe camera images when projected into the image. Finally,the center of projection of laser and camera do not coincide.This causes depth readings of objects that are occluded fromthe view point of the camera to project into the camera im-age. To counteract these drawbacks, we make use of two-view geometry principles to learn depth prediction directlyfrom the stereo camera images in an unsupervised way. Weachieve this by direct image alignment of one stereo imageto the other. This process only requires a known cameracalibration and the depth map predicted by the CNN. Oursemi-supervised approach learns from supervised and un-supervised cues concurrently.

We train the CNN to predict the inverse depth ρ(x) ateach pixel x ∈ Ω from the RGB image I . According to theground truth, the predicted inverse depth should correspondto the LiDAR depth measurement Z(x) that projects to thesame pixel, i.e.

ρ(x)−1 != Z(x). (1)

However, the laser measurements only project to a sparsesubset ΩZ ⊆ Ω of the pixels in the image.

As the unsupervised training signal, we assume photo-consistency between the left and right stereo images, i.e.,

I1(x)!= I2(ω(x, ρ(x))). (2)

In our calibrated stereo setup, the warping function can bedefined as

ω(x, ρ(x)) := x− f b ρ(x) (3)

on the rectified images, where f is the focal length and b isthe baseline. This image alignment constraint holds at everypixel in the image.

We additionally make use of the interchangeability of thestereo images. We quantify the supervised loss in both im-ages by projecting the ground truth laser data into each ofthe stereo images. We also constrain the depth estimate be-tween the left and right stereo images to be consistent im-plicitly by enforcing photoconsistency based on the inversedepth prediction for both images, i.e.,

Ileft(x)!= Iright(ω(x, ρ(x)))

Iright(x)!= Ileft(ω(x,−ρ(x))).

(4)

Finally, in textureless regions without ground truth depthreadings, the depth map prediction problem is ill-posed andan adequate regularization needs to be imposed.

3.1. Loss function

We formulate a single loss function that incorporatesboth types of constraints that arise from supervised and un-supervised cues seamlessly,

Lθ (Il, Ir, Zl, Zr) =

λtLSθ (Il, Ir, Zl, Zr) + γLUθ (Il, Ir) + LRθ (Il, Ir), (5)

where λt and γ are trade-off parameters between super-vised loss LSθ , unsupervised loss LUθ , and a regularizationterm LRθ . With θ we denote the CNN network parametersthat generate the inverse depth maps ρr/l,θ.

Supervised loss. The supervised loss term measuresthe deviation of the predicted depth map from the available

3

ground truth at the pixels,

LSθ =∑

x∈ΩZ,l

∥∥ρl,θ(x)−1 − Zl(x)∥∥δ

+∑

x∈ΩZ,r

∥∥ρr,θ(x)−1 − Zr(x)∥∥δ. (6)

We use the berHu norm ‖·‖δ as introduced in [16] to focustraining on larger depth residuals during CNN training,

‖d‖δ =

|d|, d ≤ δd2+δ2

2δ , d > δ. (7)

We adaptively set

δ = 0.2 maxx∈ΩZ

(∣∣ρ(x)−1 − Z(x)∣∣) . (8)

Unsupervised loss. The unsupervised part of our lossquantifies the direct image alignment error in both direc-tions

LUθ =∑

x∈ΩU,l

|(Gσ ∗ Il)(x)− (Gσ ∗ Ir)(ω(x, ρl,θ(x)))|

+∑

x∈ΩU,r

|(Gσ ∗ Ir)(x)− (Gσ ∗ Il)(ω(x,−ρr,θ(x)))| ,

(9)

with a Gaussian smoothing kernel Gσ with a standard devi-ation of σ = 1 px. We found this small amount of Gaussiansmoothing to be beneficial, presumably due to reducing im-age noise. We evaluate the direct image alignment loss atthe sets of image pixels ΩU,l/r of the reconstructed imagesthat warp to a valid location in the second image.

Regularization loss. As suggested in [9], the smooth-ness term penalizes depth changes at pixels with low inten-sity variation. In order to allow for depth discontinuitiesat object contours, we downscale the regularization termanisotropically according to the intensity variation:

LRθ =∑

i∈l,r

∑x∈Ω

∣∣∣φ (∇Ii(x))>∇ρi(x)

∣∣∣ (10)

with φ(g) = (exp(−η |gx|), exp(−η |gy|))> and η = 1255 .

Supervised, unsupervised, and regularization terms areseamlessly combined within our novel semi-supervised lossfunction formulation (see Fig. 2). In contrast to previousmethods, our approach treats both cameras in the stereosetup equivalently. All three loss components are formu-lated in a symmetric way for the cameras which implicitlyenforces consistency in the predicted depth maps betweenthe cameras.

Layer Channels I/O Scaling Inputs

conv172 3 / 64 2 RGB

max pool132 64 / 64 4 conv1

res block121 64 / 256 4 max pool1

res block211 256 / 256 4 res block1















conv211 2048 / 1024 32 res block16

upproject1 1024 / 512 16 conv2

upproject2 512 / 256 8 upproject1res block13



conv331 64 / 1 2 upproject4

Table 1. Layers in our deep residual encoder-decoder architecture.We input the final output layers at each resolution of the encoderat the respective decoder layers (long skip connections). This fa-cilitates the prediction of fine detailed depth maps by the CNN.

3.2. Network Architecture

We use a deep residual network architecture in anencoder-decoder scheme, similar to the supervised ap-proach in [16] (see Fig. 3). Taking inspiration from non-residual architectures such as FlowNet [3], our architectureincludes long skip connections between the encoder and de-coder to facilitate fine detail predictions at the output reso-lution. Table 1 details the various layers in our network.

Input to our network is the RGB camera image. The en-coder resembles a ResNet-50 [11] architecture (without thefinal fully connected layer) and successively extracts low-

4

Figure 3. Illustration of our deep residual encoder-decoder architecture (c1, c3, mp1 abbreviate conv1, conv3, and max pool1, respectively).Skip connections from corresponding encoder layers to the decoder facilitate fine detailed depth map prediction.

Figure 4. Type 1 residual block resblock1s with stride s = 1. The

residual is obtained from 3 successive convolutions. The residualhas the same number of channels as the input.

resolution high-dimensional features from the input im-age. The encoder subsamples the input image in 5 stages,the first stage convolving the image to half input resolu-tion and each successive stage stacking multiple residualblocks. The decoder upprojects the output of the encoderusing residual blocks. We found that adding long skip-connections between corresponding layers in encoder anddecoder to this architecture slightly improves the perfor-mance on all metrics without affecting convergence. More-over, the network is able to predict more detailed depthmaps than without skip connections.

We denote a convolution of filter size k × k and stride sby convks . The same notation applies to pooling layers, e.g.,max poolks . Each convolution layer is followed by batchnormalization with exception of the last layer in the net-work. Furthermore, we use ReLU activation functions onthe output of the convolutions except at the inputs to the sumoperation of the residual blocks where the ReLU comes af-ter the sum operation. resblockis denotes the residual blockof type i with stride s at its first convolution layer, seeFigs. 4 and 5 for details on each type of residual block.Smaller feature blocks consist of 16s maps, while largerblocks contain 4 times more feature maps, where s is theoutput scale of the residual block. Lastly, upproject is theupprojection layer proposed by Laina et al. [16]. We use thefast implementation of upprojection layers, but for better il-lustration we visualize upprojection by its ”naive” version(see Fig. 6).

4. Experiments

We evaluate our approach on the raw sequences of theKITTI benchmark [7] which is a popular dataset for sin-

Figure 5. Type 2 residual block resblock2s with stride s. The resid-

ual is obtained from 3 successive convolutions, while the first con-volution applies stride s. An additional convolution applies thesame stride s and projects the input to the number of channels ofthe residual.

Figure 6. Schematic illustration of the upprojection residual block.It unpools the input by a factor of 2 and applies a residual blockwhich reduces the number of channels by a factor of 2.

gle image depth map prediction. The sequences containstereo imagery taken from a driving car in an urban sce-nario. The dataset also provides 3D laser measurementsfrom a Velodyne laser scanner that we use as ground-truthmeasurements. This dataset has been used to train and eval-uate the state-of-the-art methods and allows for quantitativecomparison.

We use the split into 28 training and 28 testing scenesas proposed by Eigen et al. [5]. For training we decidedto even the sequence distribution with 450 frames per se-quence. This results in 7346 unique frames and 12600frames in total for training. Since there is no official val-idation set, we create one by sampling every tenth framefrom the remaining 5 sequences with little image motion.All these sequences are urban, so we additionally selectthose frames from the training sequences that are in the mid-dle between 2 training images with distance of at least 20frames. In total we obtain a validation set of 100 urban and

5

144 residential area images.

4.1. Implementation Details

We initialize the encoder part of our network withResNet-50 [11] weights pretrained for ImageNet classifica-tion task. The convolution filter weights in the decoder partare initialized randomly according to the approach of Glo-rot and Bengio [8]. We also tried the initialization by Heet al. [10] but did not notice any performance difference.We predict the inverse depth and initialize the network insuch a way that the predicted values are close to 0 in thebeginning of training. This way, the unsupervised directimage alignment loss is initialized with almost zero dispar-ity between the images. However, this also results in largegradients from the supervised loss which would cause diver-gence of the model. To achieve a convergent optimization,we slowly fade-in the supervised loss with the number ofiterations using λt = βe

−10t . We also experimented with

gradually fading in the unsupervised loss, but experienceddegraded performance on the upper part of the image. In or-der to avoid overfitting we use L2 regularization on all themodel weights with weight decay wd = 0.00004. We alsoapply dropout to the output of the last upprojection layerwith a dropout probability of 0.5.

To train the CNN on KITTI we use stochastic gradientdescent with momentum with a learning rate of 0.01 andmomentum of 0.9. We train the variants of our model for atleast 15 epochs on a 6 GB NVIDIA GTX 980Ti with 6 GBmemory which allows for a batch size of 5. We stop train-ing when the validation loss starts to increase and select themodel with the best RMSE score on the validation set. Thenetwork is trained on a resolution of 621×187 pixels forboth input images and ground truth depth maps. Hence, theresolution of the predicted inverse depth maps is 320×96.For evaluation we upsample the predicted depth maps to theresolution of the ground truth. For data augmentation, weuse γ-augmentation and also randomly multiply the inten-sities of the input images by a value α ∈ [0.8; 1.2]. We uselinear interpolation for subpixel-level warping. The infer-ence from one image takes 0.048 s in average.

4.2. Evaluation Metrics

We evaluate the accuracy of our method in depth predic-tion using the 3D laser ground truth on the test images. Weuse the following depth evaluation metrics used by Eigen etal. [5]:

RMSE:√

1T

∑Ti=1 ‖ρ(xi)−1 − Z(xi))‖22,

RMSE (log):√

1T

∑Ti=1 ‖log(ρ(xi)−1)− log(Z(xi)))‖22,

Accuracy:

∣∣∣∣i∈1,...,T∣∣∣∣max

(ρ(xi)

−1

Z(xi),Z(xi)

ρ(xi)−1

)=δ<thr

∣∣∣∣T ,

ARD: 1T

∑Ti=1

|ρ(xi)−1−Z(xi)|Z(xi)

,

SRD: 1T

∑Ti=1

|ρ(xi)−1−Z(xi)|2Z(xi)

where T is the number of pixels with ground-truth in thetest set.

In order to compare our results with Eigen et al. [5] andGodard et al. [9], we crop our image to the evaluation cropapplied by Eigenet al. (crop A). We also use the same res-olution of the ground truth depth image and cap the pre-dicted depths at 80 m [9]. For comparison with Garg etal. [6], we also provide results when capping the predicteddepths at 50 m maximum depth and using their 608×176crop (crop B). For completeness, we also give results for ourmethod evaluated on the uncropped image without a cap onthe maximum predicted depth. For all cases the minimumground-truth depth is 5 m.

4.3. Results

Table 2 shows our results in relation to the state-of-the-art methods on the test images of the KITTI benchmark.For most metrics and setups (all except of δ < 1.25 with50 m cap), our system clearly performs the best. We out-perform the best setup of Godard et al. [9] (unpublished) bymore than 17% in terms of RMSE and almost by 22% forits log scale on crop A and cap of 80 m. When evaluating ata prediction cap of 50 m and crop B, our predictions are stillin average 19 cm more accurate. However, since there areground truth points with depth bigger than 50 m present inthe dataset and our method is able to provide high-qualitypredictions for the corresponding pixels, capping the pre-dicted depth makes our approach perform worse. In partic-ular, RMSE is decreased by more than 12 cm for uncappedevaluation at the same crop.

We also qualitatively compare the output of our methodwith the state-of-the-art in Fig. 7. In some parts, the predic-tions of Godard et al. [9] may appear more detailed and ourdepth maps seem to be smoother. However, these detailsare not always consistent with the ground truth depth mapsas also indicated by the quantitative results. For instance,our predictions for the thin traffic poles and lights of the topframe in Figure 7 are more accurate.

We also analyze the contributions of the various designchoices in our approach (see Table 3). The computationof the unsupervised loss term on all valid pixels improvesthe performance comparing to the variant with unsuper-vised term evaluated only for valid pixels without availableground truth. When using the L2-norm on the supervisedloss instead of the berHu norm, the RMSE evaluation met-ric on the ground-truth depth improves on the validation set,but is worse on the test set. The L2-norm also visually pro-duces noisier depth maps. Thus, we prefer to use BerHuover L2, which reduces the noise (see Fig. 8) and performs

6

RMSE RMSE (log) ARD SRD δ < 1.25 δ < 1.252 δ < 1.253

Approach Crop Cap lower is better higher is better

Eigen et al. [5] coarse 28×144 A 80 m 7.216 0.273 0.228 - 0.679 0.897 0.967Eigen et al. [5] fine 27×142 A 80 m 7.156 0.270 0.215 - 0.692 0.899 0.967Liu et al. [20] DCNF-FCSP FT A 80 m 6.986 0.289 0.217 1.841 0.647 0.882 0.961Godard et al. [9] (unpublished) A 80 m 5.849 0.242 0.141 1.369 0.818 0.929 0.966Godard et al. [9] + CS (unpublished) A 80 m 5.763 0.236 0.136 1.512 0.836 0.935 0.968Ours, supervised only A 80 m 5.157 0.202 0.129 0.864 0.830 0.954 0.986Ours, unsupervised only A 80 m 7.323 0.288 0.202 3.242 0.766 0.919 0.964Ours A 80 m 4.903 0.194 0.123 0.859 0.848 0.958 0.986

Garg et al. [6] L12 Aug 8 B 50 m 5.104 0.273 0.169 1.080 0.740 0.904 0.962Godard et al. [9] (unpublished) B 50 m 5.061 0.221 0.123 0.944 0.843 0.942 0.972Godard et al. [9] + CS (unpublished) B 50 m 4.941 0.215 0.118 0.932 0.858 0.947 0.974Ours, supervised only B 50 m 4.958 0.196 0.123 0.795 0.839 0.956 0.986Ours, unsupervised only B 50 m 6.774 0.321 0.235 3.343 0.750 0.907 0.955Ours B 50 m 4.751 0.188 0.117 0.781 0.856 0.961 0.987Ours B ∞ 4.630 0.188 0.117 0.778 0.859 0.961 0.987

Table 2. Quantitative results of our method and approaches reported in the literature on the test set of the KITTI raw dataset used by Eigenet al. [5] for different depth prediction caps. Best results shown in bold.

RMSE RMSE (log) δ < 1.25 δ < 1.252 δ < 1.253

Approach lower is better higher is better

Our full approach 4.627 0.189 0.856 0.960 0.986Our full approach∗ 4.679 0.192 0.854 0.959 0.985L2-norm instead of BerHu-norm in supervised loss 4.659 0.195 0.841 0.958 0.986No long skip connections∗ 4.762 0.194 0.853 0.958 0.985No Gaussian smoothing in unsupervised loss∗ 4.752 0.193 0.854 0.958 0.986No long skip connections and no Gaussian smoothing∗ 4.798 0.195 0.853 0.957 0.984Supervised training only 4.862 0.197 0.839 0.956 0.986Unsupervised training only (50 m cap) 6.930 0.330 0.745 0.903 0.952Only 50 % of laser points used∗ 4.808 0.192 0.852 0.958 0.986Only 1 % of laser points used∗ 4.892 0.202 0.843 0.952 0.983

Table 3. Quantitative results of different variants of our approach on the KITTI raw dataset used by Eigen et al. [5] (without cropping andcapping the predicted depth). Approaches marked with ∗ evaluate the unsupervised loss only for the pixels without available ground truth.Best results shown in bold.

better on the test set. We also found that our system benefitsfrom both long skip connections and Gaussian smoothingin the unsupervised loss. The latter also results in slightlyfaster convergence.

To show that our approach benefits from the semi-supervised pipeline, we also give results for purely super-vised and purely unsupervised training. For purely super-vised learning, our network achieves less accurate depthmap prediction than in the semi-supervised setting. In theunsupervised case, the depth maps include larger amountsof outliers such that we provide results for capped depth pre-dictions at a maximum of 50 m. Here, our network seems toperform less well than the unsupervised methods of Godardet al. [9] and Garg et al. [6]. Notably, our approach does notperform multi-scale image alignment, but uses the availableground truth to avoid local optima of the direct image align-

ment. We also demonstrate that our system does not sufferseverely if the ground truth depth is reduced to 50% or 1%of the available measurements. To this end, we subsamplethe available laser data prior to projecting it into the cameraimage.

Our results clearly demonstrate the benefit of using adeep residual encoder-decoder architecture with long skipconnection for the task of single image depth map pre-diction. Our semi-supervised approach gives additionaltraining cues to the supervised loss through direct imagealignment. This combination is even capable to improvedepth prediction error for the laser ground-truth comparedto purely supervised learning. Our semi-supervised learn-ing method converges much faster (in about one third thenumber of iterations) than purely supervised training.

We also demonstrate the generalization ability of our

7

RGB GT [5] [20] [6] [9] Ours

from [9] from [9] from [9] from [9] from [9]Figure 7. Qualitative results and comparison with state-of-the-art methods. Ground-truth (GT) has been interpolated for visualization.Note the crisper prediction of our method on objects such as cars, pedestrians and traffic signs. Also notice, how our method can learnappropriate depth predictions in the upper part of the image that is not covered by the ground-truth.

RGB Full NoLS L2 HalfGT

Figure 8. Qualitative results of variants of our semi-supervised learning approach. Shown variants are the full model (full), without longskip connections (no LS), with L2 norm on the supervised loss (L2) and using half the ground-truth laser measurements (half GT).

Figure 9. Qualitative results on Make3D (left 2) and Cityscapes(right).

model trained on KITTI to other datasets. Fig. 9 gives qual-itative results of our model on test images of Make3D [23,25] and Cityscapes [2]. We also evaluated our model quan-titatively on Make3D where it results in 8.005 RMSE (m),0.185 Log10 error (see [16]) and 0.418 ARD. Qualitatively,our model captures the general scene layout and objectssuch as cars, trees and pedestrians well in images that sharesimilarities with the KITTI dataset.

5. Conclusions

In this paper, we proposed a novel semi-superviseddeep learning approach to monocular depth map prediction.Purely supervised learning requires a vast amount of data.In outdoor environments, often supplementary sensors suchas 3D lasers have to be used to acquire training data. These

sensors come with their own shortcoming such as specificerror and noise characteristics and sparsity of the measure-ments. We complement such supervised cues with unsuper-vised learning based on direct image alignment between theimages in a stereo camera setup. We quantify the photocon-sistency of pixels in both images that correspond to eachothers according to the depth predicted by the CNN.

We use a state-of-the-art deep residual network in anencoder-decoder architecture and enhance it with long skipconnections. Our main contribution is a seamless combina-tion of supervised, unsupervised, and regularization termsin our semi-supervised loss function. The loss terms are de-fined symmetrically for the available cameras in the stereosetup, which implicitly promotes consistency in the depthestimates. Our approach achieves state-of-the-art perfor-mance in single image depth map prediction on the popularKITTI dataset. It is able to predict detailed depth maps onthin and distant objects. It also estimates reasonable depthin image parts in which there is no ground-truth availablefor supervised learning.

In future work, we will investigate semi-supervisedlearning for further tasks such as semantic image segmenta-tion. Our approach could also be extended to couple monoc-ular and stereo depth cues in a unified deep learning frame-work.

References[1] W. Chen, Z. Fu, D. Yang, and J. Deng. Single-image depth

perception in the wild. In NIPS, 2016.

8

[2] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,R. Benenson, U. Franke, S. Roth, and B. Schiele. Thecityscapes dataset for semantic urban scene understanding.In CVPR, 2016.

[3] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas,V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. FlowNet:Learning optical flow with convolutional networks. In ICCV,2015.

[4] D. Eigen and R. Fergus. Predicting depth, surface normalsand semantic labels with a common multi-scale convolu-tional architecture. In ICCV, pages 2650–2658, 2015.

[5] D. Eigen, C. Puhrsch, and R. Fergus. Depth map predictionfrom a single image using a multi-scale deep network. InNIPS, pages 2366–2374, 2014.

[6] R. Garg, V. Kumar, G. Carneiro, and I. Reid. Unsupervisedcnn for single view depth estimation: Geometry to the res-cue. In ECCV, 2016.

[7] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-tonomous driving? The KITTI vision benchmark suite. InCVPR, 2012.

[8] X. Glorot and Y. Bengio. Understanding the difficulty oftraining deep feedforward neural networks. In AISTATS,2010.

[9] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsuper-vised monocular depth estimation with left-right consistency.arXiv:1609.03677v1, 2016.

[10] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep intorectifiers: Surpassing human-level performance on imagenetclassification. In ICCV, 2015.

[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In CVPR, 2016.

[12] K. Karsch, C. Liu, and S. B. Kang. Depth extraction fromvideo using non-parametric sampling. In ECCV, pages 775–788, 2012.

[13] J. Konrad, M. Wang, and P. Ishwar. 2D-to-3D image conver-sion by learning depth from examples. In CVPR Workshops,2012.

[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InF. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger,editors, NIPS, pages 1097–1105. 2012.

[15] L. Ladicky, J. Shi, and M. Pollefeys. Pulling things out ofperspective. In CVPR, pages 89–96, 2014.

[16] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, andN. Navab. Deeper depth prediction with fully convolutionalresidual networks. In 3DV, 2016.

[17] B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He. Depthand surface normal estimation from monocular images usingregression on deep features and hierarchical CRFs. In CVPR,pages 1119–1127, 2015.

[18] C. Li, A. Kowdle, A. Saxena, and T. Chen. Toward holisticscene understanding: Feedback enabled cascaded classifica-tion models. PAMI, 34(7):1394–1408, July 2012.

[19] B. Liu, S. Gould, and D. Koller. Single image depth estima-tion from predicted semantic labels. In CVPR, 2010.

[20] F. Liu, C. Shen, and G. Lin. Deep convolutional neural fieldsfor depth estimation from a single image. In CVPR, pages5162–5170, 2015.

[21] M. Liu, M. Salzmann, and X. He. Discrete-continuous depthestimation from a single image. In CVPR, pages 716–723,2014.

[22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,A. C. Berg, and L. Fei-Fei. ImageNet Large Scale VisualRecognition Challenge. IJCV, 115(3):211–252, 2015.

[23] A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth fromsingle monocular images. In NIPS, 2005.

[24] A. Saxena, S. H. Chung, and A. Y. Ng. 3D depth reconstruc-tion from a single still image. IJCV, 76(1):53–69, Jan. 2008.

[25] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3dscene structure from a single still image. PAMI, 2009.

[26] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. Yuille.Towards unified depth and semantic prediction from a singleimage. In CVPR, pages 2800–2809, 2015.

[27] J. Xie, R. Girshick, and A. Farhadi. Deep3d: Fully au-tomatic 2d-to-3d video conversion with deep convolutionalneural networks. In ECCV, 2016.

9

Documents

Semi-Supervised Deep Learning for Monocular Depth Map ... · Semi-Supervised Deep Learning for Monocular Depth Map Prediction ... Introduction Estimating depth from single images