Self-supervised Learning for Dense Depth Estimation in ... · navigation systems that rely on the intra-operative endoscopic video stream and do not introduce additional hardware

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XX 2019 1

Dense Depth Estimation in Monocular Endoscopywith Self-supervised Learning Methods

Xingtong Li, Ayushi Sinha, Masaru Ishii, Gregory D. Hager, Fellow, IEEE, Austin Reiter,Russell H. Taylor, Fellow, IEEE, and Mathias Unberath

Abstract—We present a self-supervised approach to trainingconvolutional neural networks for dense depth estimation frommonocular endoscopy data without a priori modeling of anatomyor shading. Our method only requires monocular endoscopicvideos and a multi-view stereo method, e. g., structure from mo-tion, to supervise learning in a sparse manner. Consequently, ourmethod requires neither manual labeling nor patient computedtomography (CT) scan in the training and application phases. Ina cross-patient experiment using CT scans as groundtruth, theproposed method achieved submillimeter mean residual error.In a comparison study to recent self-supervised depth estimationmethods designed for natural video on in vivo sinus endoscopydata, we demonstrate that the proposed approach outperformsthe previous methods by a large margin. The source codefor this work is publicly available online at https://github.com/lppllppl920/EndoscopyDepthEstimation-Pytorch.

Index Terms—Endoscopy, unsupervised learning, self-supervised learning, depth estimation

I. INTRODUCTION

M INIMALLY invasive procedures in the head and neck,e. g., functional endoscopic sinus surgery, typically

employ surgical navigation systems to provide surgeons withadditional anatomical and positional information. This helpsthem avoid critical structures, such as the brain, eyes, andmajor arteries, that are spatially close to the sinus cavities andmust not be disturbed during surgery. Computer vision-basednavigation systems that rely on the intra-operative endoscopicvideo stream and do not introduce additional hardware areboth easy to integrate into clinical workflow and cost-effective.

Manuscript received Feburary 20, 2019; revised September 21, 2019;accepted October 25, 2019. This work was funded in part by NIH R01-EB015530, in part by a research contract from Galen Robotics, in part bya fellowship grant from Intuitive Surgical, and in part by Johns HopkinsUniversity internal funds. (Corresponding author: Xingtong Liu.)

Xingtong Liu is with the Computer Science Department, Johns HopkinsUniversity, Baltimore, MD 21287 USA (e-mail: [email protected]).

Ayushi Sinha was with the Computer Science Department, Johns HopkinsUniversity, Baltimore, MD 21287 USA. She is now with Philips Research,Cambridge, MA 02141 USA (e-mail: [email protected]).

Masaru Ishii is with Johns Hopkins Medical Institutions, Baltimore, MD21224 USA (e-mail: [email protected]).

Gregory D. Hager is with the Computer Science Department, Johns HopkinsUniversity, Baltimore, MD 21287 USA (e-mail: [email protected]).

Austin Reiter was with the Computer Science Department, Johns HopkinsUniversity, Baltimore, MD 21287 USA. He is now with Facebook, New York,NY 10003 USA (e-mail: [email protected]).

Russell H. Taylor is with the Computer Science Department, Johns HopkinsUniversity, Baltimore, MD 21287 USA. He is a paid consultant to and ownsequity in Galen Robotics, Inc. These arrangements have been reviewed andapproved by JHU in accordance with its conflict of interest policy (e-mail:[email protected]).

Mathias Unberath is with the Computer Science Department, Johns HopkinsUniversity, Baltimore, MD 21287 USA (e-mail: [email protected]).

Such systems generally require registration of pre-operativedata, such as CT scans or statistical models, to the intra-operative video data [1]–[4]. This registration must be highlyaccurate to guarantee the reliable performance of the navi-gation system. To enable an accurate registration, a feature-based video-CT registration algorithm requires accurate andsufficiently dense intra-operative 3D reconstructions of theanatomy from endoscopic videos. Obtaining such reconstruc-tions is not trivial due to problems such as specular reflectance,lack of photometric constancy across frames, tissue deforma-tion, and so on.

A. Contributions

In this paper, we build upon our prior work [5] and present aself-supervised learning approach for single-frame dense depthestimation in monocular endoscopy. Our contributions are asfollows: (1) To the best of our knowledge, this is the firstdeep learning-based dense depth estimation method that onlyrequires monocular endoscopic images during both trainingand application phases. In particular, it neither needs anymanual data labeling, scaling, nor any other imaging modal-ities such as CT. (2) We propose several novel network lossfunctions and layers that exploit information from traditionalmulti-view stereo methods and enforce geometric relationshipsbetween video frames without the requirement of photometricconstancy. (3) We demonstrate that our method generalizeswell across different patients and endoscope cameras.

B. Related work

Several methods have been explored for depth estimation inendoscopy. These can be grouped into traditional multi-viewstereo algorithms and fully supervised learning-based methods.

Multi-view stereo methods, such as Structure from Mo-tion (SfM) [1] and Simultaneous Localization and Mapping(SLAM) [6], are able to simultaneously reconstruct 3D struc-ture while estimating camera poses in feature-rich scenes.However, the paucity of features in endoscopic images ofanatomy can cause these methods to produce sparse andunevenly distributed reconstructions. This shortcoming, inturn, can lead to inaccurate registrations. Mahmoud et al. pro-pose a quasi-dense SLAM-based method that explores localinformation around sparse reconstructions from a state-of-the-art SLAM system [7]. This method densifies the sparsereconstructions from a classic SLAM system and is alsoreasonably accurate. However, this approach is potentially

arX

iv:1

902.

0776

6v2

[cs

.CV

] 3

0 O

ct 2

019

https://github.com/lppllppl920/EndoscopyDepthEstimation-Pytorch

https://github.com/lppllppl920/EndoscopyDepthEstimation-Pytorch


sensitive to hyper-parameters because of the normalized cross-correlation-based matching of image patches.

Convolutional neural networks (CNN) have shown promis-ing results in high-complexity problems including generalscene depth estimation [8], which benefits from local andglobal context information and multi-level representations.However, using CNN in a fully supervised fashion in en-doscopic videos is challenging because dense ground truthdepth maps that correspond directly to the real endoscopicimages are hard to obtain. There are several simulation-basedworks that try to solve this challenge by training on syntheticdense depth maps generated from patient-specific CT data.Visentini-Scarzanella et al. use untextured endoscopy videosimulations from CT data to train a fully supervised depthestimation network and rely on another transcoder networkto convert real video frames to texture independent onesrequired for depth prediction [9]. This method requires per-endoscope photometric calibration and complex registrationdesigned for narrow tube-like structures. In addition, it remainsunclear whether this method will work on in-vivo images sincevalidation is limited to two lung nodule phantoms. Mahmoodet al. simulate pairs of color images and dense depth mapsfrom CT data for depth estimation network training. During theapplication phase, they use a Generative Adversarial Networkto convert real endoscopic images to simulation-like ones andthen feed them to the trained depth estimation network [10].In their work, the appearance transformer network is trainedseparately by simply mimicking the appearance of simulatedimages but without knowledge of the target task, i. e., depthestimation, which can lead to decreased performance up toincorrect depth estimates. Besides simulation-based methods,hardware-based solutions exist that may be advantageous inthe sense that they usually do not rely on pre-operativeimaging modalities [11], [12]. However, incorporating depthor stereo cameras into endoscopes is challenging and, evenif possible, these cameras may still fail to acquire dense andaccurate enough depth maps from endoscopic scenes for fully-supervised training because of the non-Lambertian reflectanceproperties of tissues and the paucity of features.

Several self-supervised approaches for single-frame depthestimation have been proposed in the generic field of com-puter vision [13]–[16]. However, based on our observationsand experiments, these methods are not generally applicableto endoscopy because of several reasons. First, photometricconstancy between frames assumed in their work is notavailable in endoscopy. The camera and light source movejointly, and therefore, the appearance of the same anatomycan vary substantially with different camera poses, especiallyfor regions close to the camera. Second, appearance-basedwarping loss suffers from gradient locality, as observed in [15].This can result in network training to get trapped in badlocal minima, especially for textureless regions. Compared tonatural images, the overall scarcer and more homogeneoustexture of tissues observed in endoscopy, e. g., sinus endoscopyand colonoscopy, makes it even more difficult for the networkto obtain reliable information from photometric appearance.Moreover, estimating a global scale from monocular imagesis inherently ambiguous [17]. In natural images, the scale

can be estimated using learned prior knowledge about sizesof common objects, but there are no such visual cues inendoscopy, especially for images where instruments are notpresent. Therefore, approaches that try to jointly estimatedepths and camera poses with correct global scales are unlikelyto work in endoscopy.

The first and second points above demonstrate that therecent self-supervised approaches cannot enable the networkto capture long-range correlation in either spatial or tem-poral dimension in imaging modalities where no lightingconstancy is available, e. g., endoscopy. On the other hand,traditional multi-view stereo methods, such as SfM, arecapable of explicitly capturing long-range correspondenceswith illumination-invariant feature descriptors, e. g., Scale-Invariant Feature Transform (SIFT), and global optimization,e. g., bundle adjustment. We argue that the estimated sparsereconstructions and camera poses from SfM are valuable andshould be integrated into the network training of monoculardepth estimation. We propose novel network loss functionsand layers that enable the integration of information fromSfM and enforce the inherent geometric constraints betweendepth predictions of different viewpoints. Since this approachconsiders relative camera and scene geometry, it does notassume lighting constancy. This makes our overall designsuitable for scenarios where lighting constancy cannot beguaranteed. Because of the inherent difficulty of global scaleestimation of monocular camera-based methods, we elect toonly estimate depth maps up to a global scale. This not onlyenables self-supervised learning from results of SfM, wheretrue global scales cannot be estimated, but also makes thetrained network generalizable across different patients andscope cameras, which is confirmed by our experiments. Weintroduce our method in terms of data preparation, networkarchitecture, and loss design in Section II. Experimental setupand results are demonstrated in Section III, where we showthat our method works on unseen patients and cameras.Further, we show that our method outperforms two recent self-supervised depth estimation methods by a large margin on invivo sinus endoscopy data. In Section IV and V, we discussthe limitations of our work and future directions to explore.

II. METHODS

In this section, we describe methods to train convolu-tional neural networks for dense depth estimation in monoc-ular endoscopy using sparse self-supervisory signals derivedfrom SfM applied to video sequences. We explain how self-supervisory signals from monocular endoscopy videos areextracted, and introduce our novel network architecture andloss functions to enable network training based on thesesignals. The overall training architecture is shown in Fig. 1,where all concepts are introduced in this section. Overall, thenetwork training depends on loss functions to backpropagateuseful information in the form of gradients to update networkparameters. The loss functions are Sparse Flow Loss andDepth Consistency Loss introduced in the Loss Functionssection. To use these two losses to guide the training of depthestimation, several types of input data are needed. The input


data are endoscopic video frames, camera poses and intrinsics,sparse depth maps, sparse soft masks, and sparse flow maps,which are introduced in the Training Data section. Finally,to convert network predictions obtained from the MonocularDepth Estimation to proper forms for loss calculation, severalcustom layers are used. The custom layers are Depth ScalingLayer, Depth Warping Layer, and Flow from Depth Layer,which are introduced in the Network Architecture section.

A. Training Data

Our training data are generated from unlabeled endoscopicvideos. The generation pipeline is shown in Fig. 2. Thepipeline is fully automated given endoscopic and calibrationvideos and could, in principle, be computed on-the-fly byreplacing SfM with SLAM-based methods.

Data Preprocessing. A video sequence is first undistortedusing distortion coefficients estimated from the correspond-ing calibration video. A sparse reconstruction, camera poses,and the point visibility are estimated by SfM [1] from theundistorted video sequence, where black invalid regions inthe video frames are ignored. To remove extreme outliers inthe sparse reconstruction, point cloud filtering is applied. Thepoint visibility information, appeared as b below, is smoothedout by exploiting the continuous camera movement present inthe video. The sparse-form data generated from SfM resultsare introduced below.

Sparse Depth Map. Monocular depth estimation module,shown in Fig.1, only predicts depths up to a global scale.However, to enable valid loss calculation, the scale of thedepth prediction and the SfM results must match. Therefore,the sparse depth map introduced here is used as anchorto scale the depth prediction in the Depth Scaling Layer.To generate sparse depth maps, 3D points from the sparsereconstruction from SfM are projected onto image planes withcamera poses, intrinsics, and point visibility information. Thecamera intrinsic matrix is K. The camera pose of frame jwith respect to the world coordinate is T jw, where w standsfor world coordinate system. The homogeneous coordinateof nth 3D point of the sparse reconstruction in the worldcoordinate is pw

n. Note that n can be the index of any point inthe sparse reconstruction. Frame indices used in the followingequations, e. g., j and k, can be any indices within the samevideo sequence. The difference of j and k is within a specifiedrange to keep enough region overlap. The coordinate of nth

3D point w.r.t. frame j, pjn, is

pjn = T jwpwn . (1)

The depth of nth 3D point w.r.t. frame j, zjn, is the z-axiscomponent of pjn. The 2D projection location of nth 3D pointw.r.t. frame j, ujn, is

ujn = Kpjn

zjn. (2)

We use bjn = 1 to indicate that nth 3D point is visible toframe j and bjn = 0 to indicate otherwise. Note that the pointvisibility information from SfM is used to assign the value to

bjn. The sparse depth map of frame j, Zsj , is

Zsj

(ujn)

=

{zjn if bjn = 1

0 if bjn = 0, where (3)

s stands for word ”sparse”. Note that for equations in theTraining Data section, they describe the value assignments forregions where points of the sparse reconstruction project onto.For regions where no points project onto, the values are set tozero.

Sparse Flow Map. The sparse flow map is used in theSparse Flow Loss introduced below. Previously, we directlyused the sparse depth map for loss calculation [5] to exploitself-supervisory signals of sparse reconstructions. This makesthe training objective, i. e., sparse depth map, for one framefixed and potentially biased. Unlike the sparse depth map,sparse flow map describes the 2D projected movement of thesparse reconstruction, which involves camera poses of twoinput frames with random frame interval. By combining thecamera trajectory and sparse reconstruction, and consideringall pair-wise frame combinations, the error distribution of thenew objective, i. e., sparse flow map, for one frame is morelikely to be unbiased. This makes the network less affectedby the random noise in the training data. We observe that thedepth predictions are naturally smooth with edge-preservingfor the model trained with SFL, which removes the need ofexplicit regularization during training, e. g., smoothness lossproposed in Zhou et al. [14] and Yin et al. [15].

The sparse flow map, F sj,k, represents the 2D projected

movement of the sparse reconstruction from frame j to framek.

F sj,k

(ujn)

=

ukn − ujn(W,H)

ᵀ if bjn = 1

0 if bjn = 0

, where (4)

H and W are the height and width of the frame, respectively.Sparse Soft Mask. A sparse mask enables the network

to exploit the valid sparse signals in the sparse-form dataand ignore the rest of the invalid regions. The soft weightingis defined before training and accounts for the fact that theerror distribution of individual points in the results of SfM isdifferent and mitigates the effect of reconstruction errors fromSfM. It is designed with the intuition that a larger numberof frames used in triangulating one 3D point in the bundleadjustment of SfM usually means higher accuracy. The sparsesoft mask is used in the SFL introduced below. The sparsesoft mask of frame j, Mj , is defined as

Mj

(ujn)

=

{1− e−

∑i b

in/σ if bjn = 1

0 if bjn = 0, where (5)

i iterates over all frames in the video sequence where the SfMis applied. σ is a hyper-parameter based on the average numberof frames used to reconstruct each sparse point in SfM.

B. Network Architecture

Our overall network architecture shown in Fig. 1 consistsof a two-branch Siamese network [19] in the training phase.


Fig. 1. Network architecture. Our network in the training phase (top) is a self-supervised two-branch Siamese network. Two frames j and k are randomlyselected from the same video sequence as the input to the two-branch network. To ensure enough region overlap between two frames, the frame interval iswithin a specified range. All concepts in the figure are introduced in Section II. The red dashed arrows are used to indicate the data-loss correspondence. Thewarped depth map from k to j describes the scaled depth map k viewed from the viewpoint of frame j. The dense flow map from j to k describes the 2Dprojection movement of the underlying 3D scene from frame j to k. During the application phase (bottom), we use the trained weights of the single-framedepth estimation architecture, which is a modified version of the architecture in [18], to predict a dense depth map that is accurate up to a global scale.

It relies on sparse signals from SfM and geometric constraintsbetween two frames to learn to predict dense depth mapsfrom single endoscopic video frames. In the application phase,the network has a simple single-branch architecture for depthestimation from a single frame. All the custom layers beloware differentiable so that the network can be trained in an end-to-end manner.

Monocular Depth Estimation. This module uses a mod-ified version of the 57-layer architecture in [18], knownas DenseNet, which achieves comparable performance withother popular architectures with a large reduction of networkparameters by extensively reusing preceding feature maps. Wechange the number of channels in the last convolutional layerto 1 and replace the final activation, which is log-softmax,with linear activation to make the architecture suitable forthe task of depth prediction. We also replace the transposedconvolutional layers in the up transition part of the networkwith nearest neighbor upsampling and convolutional layers toreduce the checkerboard artifact of the final output [20].

Depth Scaling Layer. This layer matches the scale of thedepth prediction from Monocular Depth Estimation and thecorresponding SfM results for correct loss calculation. Notethat all operations of the following equations are element-wiseexcept that

∑here is summation over all elements of a map.

Z ′j is the depth prediction of frame j that is correct up to a

scale. The scaled depth prediction of frame j, Zj , is

Zj =

(1∑Mj

∑(Mj

Zsj

Z ′j + ε

))Z ′j , where (6)

ε is a hyper-parameter to avoid zero division.Flow from Depth Layer. To use the sparse flow map

generated from SfM results to guide network training withthe SFL described later, the scaled depth map first needs tobe converted to a dense flow map with the relative cameraposes and the intrinsic matrix. This layer is similar to the oneproposed in [15], where they use the produced dense flowmap as the input to an optical flow estimation network. Hereinstead, we use it for the depth estimation training. The denseflow map is essentially a 2D displacement field describing a3D viewpoint change. Given the scaled depth map of framej, and the relative camera pose of frame k w.r.t. frame j,T kj =

(Rkj , t

kj

), a dense flow map between frame j and

k, Fj,k, can be derived. To demonstrate the operations ina parallelizable and differentiable way, the equations beloware described in a matrix form. The 2D locations in framej, (U, V ), are organized as a regular 2D meshgrid. Thecorresponding 2D locations of frame k are (Uk, Vk), which areorganized in the same spatial arrangement as frame j. (Uk, Vk)


Fig. 2. Training data generation pipeline. The pipeline is able to generate training data from video sequences automatically. The symbols in the figureare defined in the Training Data section. The green dots shown in the figure stand for example projected 2D locations of the sparse reconstruction. Theseprojected 2D locations are used to store valid information for all the sparse-form data, i. e., sparse depth map, sparse soft mask, and sparse flow map. Asparse depth map stores z-axis distances of the sparse reconstruction w.r.t. the camera coordinate. A sparse soft mask stores soft weights which indicate theconfidence of individual points in the sparse reconstruction. A sparse flow map stores movement of projection locations of the sparse reconstruction betweentwo frames. The generation of a sparse depth map and sparse flow map is shown in the second row of the figure, where two example projected locations areused to demonstrate the concept. The cyan dash arrows are used to indicate point correspondences between two frames. Note that the sparse-form data donot include the color information of the videos that is used to help with the visualization of the figure.

is given by

Uk =Zj (A0,0U +A0,1V +A0,2) +B0,0

Zj (A2,0U +A2,1V +A2,2) +B2,0

Vk =Zj (A1,0U +A1,1V +A1,2) +B1,0

Zj (A2,0U +A2,1V +A2,2) +B2,0

. (7)

As a regular meshgrid, U consists of H rows of[0, 1, . . . ,W − 1], and V consists of W columns of[0, 1, . . . ,H − 1]

T . A = KRkjK−1 and B = −Ktkj . Am,n

and Bm,n are elements of A and B at position (m,n),respectively. The dense flow map, Fj,k, for describing the 2Ddisplacement field from frame j to frame k is

Fj,k =

(Uk − UW

,Vk − VH

). (8)

Depth Warping Layer. The sparse flow map mainly pro-vides guidance to regions of a frame where sparse informationfrom SfM gets projected onto. Given that most frames onlyhave a small percentage of pixels whose values are valid in asparse flow map, most regions are still not properly guided.With the camera motion and camera intrinsics, geometricconstraints between two frames can be exploited by enforcingconsistency between the two corresponding depth predictions.The intuition is that the dense depth maps predicted separatelyfrom two neighbor frames are correlated because there isoverlap between the observed regions. To make the geometricconstraints enforced in the Depth Consistency Loss describedlater differentiable, the viewpoints of the depth predictionsmust be aligned first. Because a dense flow map describesa 2D projected movement of the observed 3D scene, Uk andVk described above can be used to change the viewpoint of

the depth Zk from frame k to frame j with an additional step,which is modifying Zk to describe the depth value changesdue to the viewpoint changing. The modified depth map offrame k, Zk, is

Zk = Zk (C2,0U + C2,1V + C2,2) +D2,0 , where (9)

C = KRjkK−1, D = Ktjk. With Uk, Vk and Zk, the bilinear

sampler in [21] is able to generate the dense depth map Zk,jthat is warped from the viewpoint of frame k to that of framej

C. Loss Functions

We propose novel losses that can exploit self-supervisorysignals from SfM and enforce geometric consistency betweendepth predictions of two frames.

Sparse Flow Loss (SFL). To produce correct dense depthmaps that agree with sparse reconstructions from SfM, thenetwork is trained to minimize the differences between thedense flow maps and the corresponding sparse flow maps. Thisloss is scale-invariant because it considers the difference of the2D projected movement in the unit of pixel, which solves thedata imbalance problem caused by the arbitrary scales of SfMresults. The SFL associated with frame j and k is calculatedas

Lflow (j, k) =1∑Mj

∑(Mj |F s

j,k − Fj,k|)+

1∑Mk

∑(Mk|F s

k,j − Fk,j |)

.(10)


Depth Consistency Loss (DCL). Sparse signals from theSFL alone could not provide enough information to enable thenetwork to reason about regions where no sparse annotationsare available. Therefore, we enforce geometric constraintsbetween two independently predicted depth maps. The DCLassociated with frame j and k is calculated as

Lconsist (j, k) =

∑(Wj,k

(Zj − Zk,j

)2)∑(

Wj,k

(Z2j + Z2

k,j

))+

∑(Wk,j

(Zk − Zj,k

)2)∑(

Wk,j

(Z2k + Z2

j,k

)) ,

(11)

where Wj,k is the intersection of valid regions of Zj and thedense depth map Zj,k that is predicted from frame k butwarped to the viewpoint of frame j. Because SfM resultscontain arbitrary global scales, this loss only penalizes therelative difference between two dense depth maps to avoiddata imbalance.

Overall Loss. The overall loss function for network train-ing with a single pair of training data from frames j and k is

L (j, k) = λ1Lflow (j, k) + λ2Lconsist (j, k) . (12)

III. EXPERIMENT AND RESULTS

A. Experiment SetupAll experiments are conducted on a workstation with 4

NVIDIA Tesla M60 GPU, each with 8 GB memory. Themethod is implemented using PyTorch [23]. The dataset con-tains 10 rectified sinus endoscopy videos acquired with differ-ent endoscopes. The videos were collected from 8 anonymizedand consenting patients and from 2 cadavers under an IRBapproved protocol. The overall duration of videos is approxi-mately 30 minutes. In all leave-one-out experiments below, thedata from 7 out of 8 patients are used for training. The datafrom the 2 cadavers are used for validation and the left-outpatient is used for testing.

We select trained models for evaluation based on thenetwork loss on the validation dataset. Overall, two typesof evaluation are conducted. One is comparing point cloudsconverted from depth predictions with the corresponding sur-face models from CT data. The other is directly comparingdepth predictions with the corresponding sparse depth mapsgenerated from SfM results.

For the evaluation related to CT data, we pick 20 frameswith sufficient anatomical variation per testing patient. Thedepth predictions are converted to point clouds. The initialglobal scales and poses of point clouds before registrationare manually estimated. To this end, we pick the same setof anatomical landmarks in both the point cloud and the cor-responding CT surface model. 3000 uniformly sampled pointsfrom each point cloud are registered to the correspondingsurface models generated from the patient CT scans [24] usingIterative Most Likely Oriented Point (IMLOP) algorithm [25].We modify the registration algorithm to estimate a similaritytransform with hard constraint during optimization. The con-straint is to prevent the point cloud from deviating from the

initial alignment too much, given that the initial alignmentsare approximately correct. The residual error is defined asthe average Euclidean distance over all closest point pairs ofthe registered point cloud to the surface model. The averageresidual errors over all point clouds are used as the accuracyestimate of the depth predictions.

For the evaluation related to SfM, all video frames of thetesting patient where a valid camera pose is estimated bySfM are used. Sparse depth maps are first generated fromthe SfM results. For a fair comparison, all depth predictionsare first re-scaled with the corresponding sparse depth mapsusing the Depth Scaling Layer to match the scale of the depthpredictions and SfM results. Because of the scale ambiguity ofthe SfM results, we only use common scale-invariant metricsfor evaluation. The metrics are Absolute Relative Difference,which is defined as: 1

|T |Σy∈T |y−y∗|/y∗, and Threshold, which

is defined as: % of y s.t. max(yiy∗i,y∗iyi

)< σ, with three

different σ, which are 1.25, 1.252, and 1.253 [15]. The metricsare only evaluated on the valid positions in the sparse depthmaps and the corresponding locations in the depth predictions.

In terms of the sparsity of the reconstructions fromSfM. The number of points per sparse reconstruction is4687 (±6276). After smoothing out the point visibility in-formation from SfM, the number of projected points perimage from the sparse reconstruction is 1518 (±1280).Given the downsampled image resolution, this means that1.85 (±1.56) % of pixels in the sparse-form data have validinformation. In the training and application phase, all imagesextracted from the videos are cropped to remove the invalidblank regions and downsampled to the resolution of 256×320.The range for smoothing the point visibility information in theData Preprocessing section is set to 30. The frame interval oftwo frames that are randomly selected from the same sequenceand fed to the two-branch training network is set to [5, 30]. Weuse extensive data augmentation during experiments to makethe training data distribution unbiased to specific patients orcameras as much as possible, e. g., random brightness, randomcontrast, random gamma, random HSV shift, Gaussian blur,motion blur, jpeg compression, and Gaussian noise. Duringnetwork training, we use Stochastic Gradient Descent (SGD)optimization with momentum set to 0.9 and cyclical learningrate scheduler [26] with learning rate from 1.0 e−4 to 1.0 e−3.The batch size is set to 8. The σ for generating the soft sparsemasks is set to the average track length of points in the sparsereconstructions from SfM. The ε in the depth scaling layer isset to 1.0e−8. We train the network with 80 epochs in total.λ1 is always 20.0. For the first 20 epochs, λ2 is set to 0.1 tomainly use SFL for initial convergence. For the remaining 60epochs, λ2 is set to 5.0 to add more geometric constraints tofine-tune the network.

B. Cross-patient Study

To show the generalizability of our method, we conduct4 leave-one-out experiments where we leave out Patient 2,3, 4, and 5, respectively, during training for evaluation. Datafrom other patients are not used for evaluation for the lack of


Fig. 3. Qualitative result comparison between our method, Zhou et al. [14], and Yin et al. [15]. The first column consists of testing and trainingimages, where the first 3 are testing ones. The second and third columns consist of corresponding depth maps and reconstructions from our method. The fourthand fifth columns are from Zhou et al.. The last two columns are from Yin et al.. For each displayed video frame, a sparse depth map is used to re-scaledepth predictions from three methods. The scaled depth predictions are then normalized with the same max depth values for 2D visualization, where the samedepth color coding as Fig. 1 is used. The point clouds converted from the depth predictions are post-processed by a standard Poisson surface reconstructionmethod [22] for 3D visualization. It shows that our method performs consistently better than Zhou et al.and Yin et al.in both testing and training cases.

corresponding CT scans. The quantitative evaluation resultsin Fig. 4 (a) show that our method achieves submillimeterresidual errors for all testing reconstructions. The averageresidual error over testing frames from all 4 testing patients is0.40 (±0.18) mm. For a better understanding of the accuracyof the reconstructions, the average residual error reported byLeonard et al. [1], where the same SfM algorithm that we useto generate training data is evaluated, is 0.32 (±0.28) mm. Weuse the same clinical data for evaluation as theirs in this work.Therefore, it shows our method achieves comparable perfor-mance with the SfM algorithm [1], though our reconstructionsare estimated from single views.

C. Comparison Study

We conduct a comparison study to evaluate the performanceof our method against two typical self-supervised depth esti-mation methods [14], [15]. We use the original implementationof both methods with a slight modification, where we omit theblack invalid regions of endoscopy images when computinglosses during training. In Fig. 3, we show representativequalitative results for all three methods. In Fig. 5, we overlaythe CT surface model with the registered point clouds of onevideo frame from all three methods. We also compare ourmethod with these methods quantitatively. Table. I, where theevaluation related to SfM is used, shows evaluation results of

depth predictions from all three methods, revealing that ourmethod outperforms both competing approaches by a largemargin. Note that all video frames from Patient 2, 3, 4, and5 are used for evaluation. For this evaluation, all four trainedmodels in the Cross-patient Study are used to generate depthpredictions for each corresponding testing patient to test theperformance of our method. For Zhou et al.and Yin et al.,the evaluation model sees all patient data except Patient 4during training. Therefore, it is a comparison in favor of thecompeting methods. The bad performance of the competingmethods on the training and testing dataset shows that it isnot overfitting that makes the model performance worse thanours. Instead, these two methods cannot make the networkexploit signals in the unlabeled endoscopy data effectively. Theboxplot in Fig. 4 (b) shows the comparison results with the CTsurface models. For the ease of experiments, only the data fromPatient 4 are used for this evaluation. The average residual er-ror of our reconstructions is 0.38 (±0.13) mm. For Zhou et al.,it is 1.77 (±1.19) mm. For Yin et al., it is 0.94 (±0.36) mm.The extreme outliers of reconstructions from Zhou et al.areremoved before error calculation.

We believe the main reason for the inferior performance ofthe two comparison methods lies in the choice of main drivingpower to achieve self-supervised depth estimation. Zhou etal.choose L1 loss to enforce photometric consistency between


Fig. 4. (a) Boxplot of residual errors for cross-patient study. The id’sof the testing patients are used as labels on the horizontal axis. All testingreconstructions have submillimeter residual errors. (b) Boxplot of residualerrors for comparison study and ablation study. We compare our methodwith Zhou et al. [14] and Yin et al. [15] quantitatively using data from Patient4 for testing. The difference between the residual errors from ours and theother two methods are statistically significant (p < .001). For ablation study,a model is trained with SFL only to compare with the model trained withboth SFL and DCL.

Fig. 5. Reconstructions registered to patient CT. Alignment producedbetween our reconstruction and the corresponding patient CT (left) shows thatour reconstruction adheres well to the contours of the patient CT and containsfew outliers. Whereas alignment between the reconstructions from Zhou etal.(middle) and Yin et al.(right) for the same frame and the correspondingpatient CT shows poor alignment and many outliers. Many points of thereconstructions by Zhou et al.and Yin et al.fall inside the regions where theendoscope cannot enter.

two frames. This assumes the appearance of a region does notchange when the viewpoint changes, which is not the case inmonocular endoscopy where the lighting source moves jointlywith the camera. Yin et al.use a weighted average of StructuralSimilarity (SSIM) loss and L1 loss. SSIM is less susceptible tobrightness changes and pays attention to textural differences.However, since only simple statistics of an image patch areused to represent the texture in SSIM, the expressivity isnot enough for cases with scarce and homogeneous texture,such as sinus endoscopy and colonoscopy, to avoid bad localminimal during training. This is especially true for the tissuewalls present in the sinus endoscopy, where we observeerroneous depth predictions.

TABLE IEVALUATION WITH SFM RESULTS*

Method Absolute rel. diff. Thresholdσ = 1.25 σ = 1.252 σ = 1.253

Ours 0.20 0.75 0.93 0.98Zhou et al. [14] 0.66 0.41 0.68 0.83Yin et al. [15] 0.41 0.54 0.78 0.89

* The model performance on data from Patient 2, 3, 4, and 5 is evaluated with twometrics, which are Absolute Relative Difference and Threshold [15]. The sparsedepth maps generated from SfM results are used as groundtruth. The models ofour method for evaluation are those used in the cross-patient study, which meansthe data from all four patients are not seen during training. On the other hand,the models of Zhou et al.and Yin et al.have seen data from Patient 2, 3, and 5during training.

Fig. 6. Qualitative result for ablation study. The results consist oftraining and testing images, where the first 2 images are seen during training.The second and third columns consist of corresponding depth maps andreconstructions from the model trained with only SFL. The fourth and fifthcolumns are from the model trained with both SFL and DCL. The resultshows that DCL helps with both training and testing cases. It providesadditional guidance to regions where sparse reconstructions from SfM areeither inaccurate, e. g., regions with specularity in the first row, or missing,e. g., regions near the boundary in the second and third row.

D. Ablation Study

To evaluate the effect of loss components, i. e., SFL andDCL, a network is trained with only SFL with Patient 4 fortesting. The model trained in the Cross-patient Study withPatient 4 for testing is used for comparison. Since DCL aloneis not able to train a model with meaningful results, we do notevaluate its performance alone. The qualitative (Fig. 6) andquantitative (Fig. 4 (b)) results show that the model trainedwith SFL and DCL combined has a better performance thanthe model trained with SFL only. In terms of the evaluationresults on data from Patient 4, the average residual errorfor the model trained with SFL only is 0.47 (±0.10) mm.In terms of the evaluation related to SfM, the values ofmetrics including absolute relative difference, threshold testwith σ = 1.25, 1.252, 1.253 are 0.14, 0.81, 0.98, 1.00, re-spectively. In comparison, the average residual error for themodel trained with SFL and DCL is 0.38 (±0.13) mm. Thevalues of the same metrics as above are 0.13, 0.85, 0.98, 1.00,respectively, which shows slight improvement compared withthe model trained with SFL only. Note that sparse depth mapsare unevenly distributed and there are usually few valid pointsfor evaluation on the tissue wall which DCL is observed tohelp most with. Therefore, the observed improvement in theevaluation related to SfM is not as large as the average residualerror in the evaluation related to CT data.


IV. DISCUSSION

The proposed method does not require any labeled data fortraining and generalizes well across endoscopes and patients.The method was initially designed for and evaluated onsinus endoscopy data, however, we are confident that it isalso applicable to monocular endoscopy of other anatomies.However, some limitations of our method remain that need tobe addressed in the future work. First, the training phase of ourmethod relies on the reconstructions and camera poses fromSfM. On the one hand, this means our method will evolveand improve with more advanced SfM algorithms becomingavailable. On the other hand, this means our method does notapply to cases where the SfM is not able to produce reasonableresults. Whereas our method tolerates random errors andoutliers from SfM to a certain extent, if large systematicerrors occur in a large portion of the data, which could occurin cases of highly dynamic environments, our method willlikely fail. Second, our method only produces dense depthmaps up to a global scale. In scenarios where the globalscale is required, additional information needs to be providedduring the application phase to recover the global scale. Thiscan be achieved e. g., by measuring known-size objects orusing external tracking devices. In terms of the inter-framegeometric constraints, concurrent to our work, 3D ICP losswas proposed by [16] to enforce geometric consistency of twodepth predictions. Because the Iterative Closest Point (ICP)used in their loss calculation is not differentiable, they use theresidual errors of the point cloud registration upon convergenceas the difference approximation of two depth predictions.There are two advantages of the proposed DCL over the 3DICP loss. First, it is able to handle errors between two depthpredictions that can be compensated by a rigid transformation.Second, it does not involve a registration method which canpotentially introduce erroneous information for training whena registration failure happens. Because the implementation ofthe 3D ICP loss is not released, no comparison is made inthis work. Recently, a similar geometric consistency loss [27]has been proposed, which is subsequent to our work [5]. Interms of the evaluation, the average residual error reported inthe evaluation related to CT data can lead to underestimatederrors. This is because the residual error is calculated usingpairs of closest points between the registered point clouds andthe CT surface models. Since the distance between a closestpoint pair is always less than or equal to the distance betweenthe true point pair, the overall error will be underestimated.Depending on the accuracy of SfM, the evaluation related toSfM may better represent the true accuracy for regions of thedepth predictions that have valid correspondences in the sparsedepth maps. But this metric has the disadvantage that regionswhere no valid correspondences exist in the sparse depth mapsare not evaluated. The exact accuracy estimate is available onlyif the camera trajectory of a video is accurately registered tothe CT surface model, which is what we currently do not haveand will work on as a future direction.

V. CONCLUSION

In this work, we present a self-supervised approach to train-ing convolutional neural networks for dense depth estimation

in monocular endoscopy without any a priori modeling ofanatomy or shading. To the best of our knowledge, this isthe first deep learning-based self-supervised depth estimationmethod proposed for monocular endoscopy. Our method onlyrequires monocular endoscopic videos and a multi-view stereomethod during the training phase. In contrast to most compet-ing methods for self-supervised depth estimation, our methoddoes not assume photometric constancy, making it applicableto endoscopy. In a cross-patient study, we demonstrate thatour method generalizes well to different patients, achievingsubmillimeter residual errors even when trained on smallamounts of unlabeled training data from several other patients.In a comparison study, we show that our method outperformstwo recent self-supervised depth estimation methods by a largemargin on in vivo sinus endoscopy data. For future work, weplan to fuse depth maps from single frames to form an entire3D model to make it more suitable for applications such asclinical anatomical study and surgical navigation.

REFERENCES

[1] S. Leonard, A. Sinha, A. Reiter, M. Ishii, G. L. Gallia, R. H. Taylor,et al., “Evaluation and stability analysis of video-based navigationsystem for functional endoscopic sinus surgery on in vivo clinical data,”IEEE Trans. Med. Imag., vol. 37, no. 10, pp. 2185–2195, Oct. 2018.

[2] A. Sinha, X. Liu, A. Reiter, M. Ishii, G. D. Hager, and R. H. Taylor,“Endoscopic navigation in the absence of ct imaging,” in Med. ImageComput. Comput. Assist. Interv. MICCAI 2018 - 21st Int. Conf., 2018,Proc. Cham: Springer International Publishing, 2018, pp. 64–71.

[3] H. Suenaga, H. H. Tran, H. Liao, K. Masamune, T. Dohi, K. Hoshi,et al., “Vision-based markerless registration using stereo vision and anaugmented reality surgical navigation system: a pilot study,” BMC Med.Imaging, vol. 15, no. 1, p. 51, 2015.

[4] L. Yang, J. Wang, T. Ando, A. Kubota, H. Yamashita, I. Sakuma,et al., “Vision-based endoscope tracking for 3d ultrasound image-guidedsurgical navigation,” Comput. Med. Imaging Graph., vol. 40, pp. 205–216, 2015.

[5] X. Liu, A. Sinha, M. Unberath, M. Ishii, G. Hager, R. Taylor, et al.,“Self-supervised learning for dense depth estimation in monocularendoscopy,” in OR 2.0 Context Aware Oper. Theaters Comput. Assist.Robot. Endosc. Clin. Image Based Proced. Skin Image Anal. SpringerVerlag, 2018, pp. 128–138.

[6] O. G. Grasa, E. Bernal, S. Casado, I. Gil, and J. Montiel, “Visual slamfor handheld monocular endoscope,” IEEE Trans. Med. Imag., vol. 33,no. 1, pp. 135–146, 2014.

[7] N. Mahmoud, A. Hostettler, T. Collins, L. Soler, C. Doignon, andJ. Montiel, “Slam based quasi dense reconstruction for minimallyinvasive surgery scenes,” arXiv preprint arXiv:1705.09107, 2017.

[8] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab,“Deeper depth prediction with fully convolutional residual networks,”in 2016 4th Int. Conf. 3D Vis., Oct. 2016, pp. 239–248.

[9] M. Visentini-Scarzanella, T. Sugiura, T. Kaneko, and S. Koto, “Deepmonocular 3d reconstruction for assisted navigation in bronchoscopy,”Int. J. Comput. Assist. Radiol. Surg., vol. 12, no. 7, pp. 1089–1099,2017.

[10] F. Mahmood and N. J. Durr, “Deep learning and conditional randomfields-based depth estimation and topographical reconstruction fromconventional endoscopy,” Med. Image Anal., vol. 48, pp. 230–243, 2018.

[11] S.-P. Yang, J.-J. Kim, K.-W. Jang, W.-K. Song, and K.-H. Jeong,“Compact stereo endoscopic camera using microprism arrays,” Opt.Lett., vol. 41, no. 6, pp. 1285–1288, 2016.

[12] M. Simi, M. Silvestri, C. Cavallotti, M. Vatteroni, P. Valdastri, A. Men-ciassi, et al., “Magnetically activated stereoscopic vision system forlaparoendoscopic single-site surgery,” IEEE/ASME Trans. Mechatronics,vol. 18, no. 3, pp. 1140–1151, 2013.

[13] R. Garg, V. K. BG, G. Carneiro, and I. Reid, “Unsupervised cnn forsingle view depth estimation: Geometry to the rescue,” in Comput. Vis.ECCV. Springer, 2016, pp. 740–756.

[14] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervisedlearning of depth and ego-motion from video,” in 2017 IEEE Conf.Comput. Vis. Pattern Recognit., Jul. 2017, pp. 6612–6619.


[15] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth,optical flow and camera pose,” in 2018 IEEE Conf. Comput. Vis. PatternRecognit., 2018, pp. 1983–1992.

[16] R. Mahjourian, M. Wicke, and A. Angelova, “Unsupervised learningof depth and ego-motion from monocular video using 3d geometricconstraints,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. PatternRecognit., 2018, pp. 5667–5675.

[17] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from asingle image using a multi-scale deep network,” in Adv. Neural Inf.Process. Syst. 27. Curran Associates, Inc., 2014, pp. 2366–2374.

[18] S. Jegou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio, “Theone hundred layers tiramisu: Fully convolutional densenets for semanticsegmentation,” in 2017 Conf. Comput. Vis. Pattern Recognit. Workshops.IEEE, 2017, pp. 1175–1183.

[19] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metricdiscriminatively, with application to face verification,” in Proc. 2005IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 1.Washington, DC, USA: IEEE Computer Society, 2005, pp. 539–546.[Online]. Available: https://doi.org/10.1109/CVPR.2005.202

[20] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboardartifacts,” Distill, vol. 1, no. 10, p. e3, 2016.

[21] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu,“Spatial transformer networks,” in Proc. 28th Int. Conf. Neural Inf.Process. Syst., vol. 2. Cambridge, MA, USA: MIT Press, 2015,pp. 2017–2025. [Online]. Available: http://dl.acm.org/citation.cfm?id=2969442.2969465

[22] M. Kazhdan and H. Hoppe, “Screened poisson surface reconstruction,”ACM Trans. Graph., vol. 32, no. 3, pp. 29:1–29:13, Jul. 2013. [Online].Available: http://doi.acm.org/10.1145/2487228.2487237

[23] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, et al.,“Automatic differentiation in PyTorch,” in NIPS 2017 Autodiff Workshop,2017.

[24] A. Sinha, A. Reiter, S. Leonard, M. Ishii, G. D. Hager, and R. H.Taylor, “Simultaneous segmentation and correspondence improvementusing statistical modes,” in Med. Imag. 2017: Image Process., M. A.Styner and E. D. Angelini, Eds., vol. 10133, International Society forOptics and Photonics. SPIE, 2017, pp. 377 – 384. [Online]. Available:https://doi.org/10.1117/12.2253533

[25] S. Billings and R. Taylor, “Iterative most likely oriented point registra-tion,” in Med. Image Comput. Comput. Assist. Interv. Cham: SpringerInternational Publishing, 2014, pp. 178–185.

[26] L. N. Smith, “Cyclical learning rates for training neural networks,” in2017 IEEE Winter Conf. Appl. Comput. Vis., Mar. 2017, pp. 464–472.

[27] J.-W. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, et al.,“Unsupervised scale-consistent depth and ego-motion learning frommonocular video,” arXiv preprint arXiv:1908.10553, 2019.

https://doi.org/10.1109/CVPR.2005.202

http://dl.acm.org/citation.cfm?id=2969442.2969465

http://dl.acm.org/citation.cfm?id=2969442.2969465

http://doi.acm.org/10.1145/2487228.2487237

https://doi.org/10.1117/12.2253533

Documents

Self-supervised Learning for Dense Depth Estimation in ... · navigation systems that rely on the intra-operative endoscopic video stream and do not introduce additional hardware