Simultaneous Video Defogging and Stereo Reconstruction€¦ · Simultaneous Video Defogging and Stereo Reconstruction Zhuwen Li1, Ping Tan2, Robby T. Tan3, Danping Zou4, Steven Zhiying

Simultaneous Video Defogging and Stereo Reconstruction

Zhuwen Li1, Ping Tan2, Robby T. Tan3, Danping Zou4∗, Steven Zhiying Zhou1,5 and Loong-Fah Cheong1

1National University of Singapore 2Simon Fraser University 3SIM University4Shanghai Jiao Tong University 5National University of Singapore (Suzhou) Research Institute

Abstract

We present a method to jointly estimate scene depth andrecover the clear latent image from a foggy video sequence.In our formulation, the depth cues from stereo matching andfog information reinforce each other, and produce superi-or results than conventional stereo or defogging algorithm-s. We first improve the photo-consistency term to explicitlymodel the appearance change due to the scattering effects.The prior matting Laplacian constraint on fog transmissionimposes a detail-preserving smoothness constraint on thescene depth. We further enforce the ordering consistencybetween scene depth and fog transmission at neighboringpoints. These novel constraints are formulated together inan MRF framework, which is optimized iteratively by intro-ducing auxiliary variables. The experiment results on realvideos demonstrate the strength of our method.

1. IntroductionMulti-view stereo receives intensive investigations [32].

Most of these methods, however, are designed for imagescaptured in clear scenes, and consequently foggy or under-water images present a significant challenge. One of thereasons is because these images are associated with poorimage contrast, which causes image features to be less dis-tinctive and confuses stereo matching. Another reason isthat the photo-consistency measure in conventional stere-o algorithms does not consider the scattering and absorp-tion phenomenon during light propagation, and thus gener-ate systematic matching errors.

While fog poses challenges for stereo algorithm, it al-so brings compensatory advantages. From a computationalviewpoint, it is well known [16] that stereo does not workwell for large distances. The depth smoothness prior fur-ther results in loss of surface details [42] such as thin e-longated structures and holes. Fog transmission informa-tion (i.e. the α-channel) contains depth cues that are qual-itatively different, because: 1) fog transmission providesdepth ordering, since thicker fog is associated with largerdistance, 2) fog transmission satisfies the matting Laplacian

∗D. Zou worked on this project as a research fellow with P. Tan.

[21], which provides a detail preserving smoothness prioron scene depth.

At the same time, stereo vision helps defogging. Previ-ous defogging techniques are mainly designed to defog asingle image, which is essentially an ill posed problem dueto the airlight-albedo ambiguity [10]. Recent techniques re-ly on additional heuristic assumptions, such as piecewiseconstant albedo [10], maximum local contrast [35] or darkchannel prior [17]. But all these assumptions can be fooledby the airlight-albedo ambiguity. As a result, nearby objectswith unsaturated color are often mistaken as faraway objectswith saturated color. It is demonstrated [19] that even roughdepth information can significantly improve the defoggingperformance. A precise depth estimation from stereo willhelp to reduce the airlight-albedo ambiguity in defogging.

A naive solution for simultaneous stereo and defoggingis to apply both algorithms iteratively, e.g. iteratively defogall images and then apply stereo vision to the defogged re-sults. However, most of existing defogging algorithms aredesigned for a single image, and will generate temporallyinconsistent results when they are process on a frame-by-frame manner. Thus, the following stereo algorithms willgenerate large error on these inconsistently defogged im-ages. This problem is shown in our experiments.

In this paper, we study stereo vision and defogging prob-lems jointly, and design an algorithm that simultaneouslyestimates scene depth and defogs the input images. Ourmethod is based on the observation that the depth cues fromstereo and fog thickness are complementary to each other(i.e. stereo cues are more reliable for nearby objects and fogthickness cues are more reliable for faraway objects). Ourmethod performs best on scenes with thick fog or large cam-era movement when both depth cues are strong at close andfar. This feature makes our algorithm especially suitablefor applications in autonomous navigation of unmanned ve-hicles in bad weather or underwater, such as in [23, 29].

Our method includes four key features. Firstly, we im-prove the photo-consistency term in stereo matching to in-corporate the scattering effect. When evaluating the consis-tency of two pixels from different viewpoints, we explicitlymodel their appearance change due to fog. Secondly, we

1

compute the fog transmission at each pixel directly fromthe scene depth (and the estimated fog density). This en-sures the results from our stereo and defogging are consis-tent with each other, relieves the airlight-albedo ambiguityin defogging, and maintains temporal consistency in the fi-nal defogged video. Thirdly, we incorporate a strong prioron fog transmission into the joint stereo-defogging formu-lation. Specifically, we impose the matting Laplacian con-straint [21] to the scene depth, since the fog transmissioncan be directly computed from depth. As we will see fromthe experiments, this constraint helps to capture the fine de-tails in the depth map. Lastly, we also incorporate pairwisedepth ordering constraint, leveraging on the reliability offog transmission in conveying ordinal depth information.

2. Related WorkMany multi-view stereo (MVS) vision algorithms have

been proposed and we summarize a few of the most recentand related works here. More detailed discussion of variousstereo algorithms refers to the excellent survey [32], andthe multi-view stereo evaluation for the recent top perform-ing methods refers to [1]. In particular, some of the recentworks that are more related to ours are those based on themultiple depth map approach [5, 6, 12, 13, 15, 34, 43].

A significant advantage that MVS algorithms have overtwo-view stereo is their ability to reason in a principledmanner about visibility and occlusions. MVS algorithm-s can also impose the consistency between disparity esti-mates at different frames to reduce sensitivity to outliers inindividual frames. Similar to the work of [43], we add a ge-ometric coherence term to the conventional data term basedon photoconsistency, to impose all the preceding geometricconstraints. Despite the improved results, all these ener-gy functions, when minimized by global optimization tech-niques such as multi-label graph-cut method [4], still sufferfrom characteristic artifacts. The resultant depth maps typ-ically exhibit shrinking bias (shortcutting of the segmenta-tion boundary due to bias toward shorter boundaries). Irreg-ular or thin objects such as trees, bushes, branches, fencesare often poorly reconstructed or even completely missed(see the examples in [13], or the flower sequence in [43]).The reconstruction results also depend on the evolving qual-ity of the current depth estimate, which in the case of a fog-gy or underwater scene might be too poor to yield good re-sults. Our work not only incorporates the scattering effectto improve stereo matching, it also makes use of the fogtransmission information to preserve details in depth maps.

A number of single-image dehazing or defogging meth-ods have been proposed [2, 10, 17, 24, 28, 35, 36]. Tan’smethod [35] obtains airlight, i.e. light scattered by parti-cles, by maximizing local contrast. Fattal’s [10] decompos-es shading and transmission functions by assuming both ofthem are statistically uncorrelated locally. He et al.’s [17]

introduces a dark channel prior that exploits the minimumintensity of all color channels in a local window to indi-cate the level of haziness. The most recent work, Meng etal.’s [24], introduces a boundary constraint on the transmis-sion function. Ancuti et al.’s [2], instead of dealing withfog, turns to underwater vision by employing image fusion.While all these single image defogging methods work tosome extent, their major problem is the ambiguity betweenairlight and albedo. Nearby objects with low color satura-tion will be considered to be far away objects with saturatedcolor, since the methods will mistakenly think the surfacesare under heavy airlight. Aside from using a single image,a few methods have been introduced to utilize multiple im-ages [25, 31, 38]. These methods require pixelwise regis-tration of images with different particle data. Consequently,they cannot be applied if the scenes are dynamic, or if thecamera moves.

Recently, several methods are introduced [7, 27, 30] tosolve stereo for foggy or underwater images. All of themare designed for a pair of stereo images, while our methodprocesses a video sequence from a monocular moving cam-era. Caraffa and Tarel’s method [7] combines the conven-tional photo-consistency term and the scattering equation tosimultaneously solve stereo and defog. However, as we dis-cussed in Section 4.1, the conventional photo-consistencyterm becomes less effective in scattering media. Further-more, the scattering equation is sensitive to the nonlinearcamera response function and image noise. As a result, thismethod is only demonstrated on synthetic data. Nascimen-toet al. [27] and Roser et al.’s method [30] iterates two step-s: applying a conventional stereo algorithm to compute thedense depth; taking the depth to inverse the scattering equa-tion to estimate the clear latent image. They further applythe matting Laplacian [21] as an edge preserving filter to en-hance the depth in the iteration. This straightforward com-bination of suffers from two flaws. Firstly, the light scatter-ing effects is not modeled in the photo-consistency measurefor stereo matching, which causes erroneous stereo recon-struction. Secondly, different video frames are defogged in-dependently, which generates inconsistent frames and lead-s to systematic error in stereo reconstruction. Our experi-ments verify these problems. In comparison, our formula-tion presents a tight fusion of the depth cues from stereo anddefogging and produces much stronger results.

Our work is also related to [26] which studies structuredlight based stereo in scattering media. In contrast to this,however, our work focuses on passive stereo in scatteringmedia with completely different formulation and setup.

3. BackgroundBefore formulating our approach, we give an overview

of the typical formulations for the defogging and MVSproblems and introduce the notations used in the paper.

3.1. Fog model

In computer vision, a widely used scattering model is[9, 10, 17, 25, 35]:

I(x) = J(x)α(x) +A(1− α(x)), (1)

where I is the observed image in scattering media (e.g. fog,haze, or turbid water), J is the latent clear image unaffect-ed by the media, A is the global atmospheric light, and αis the medium transmission determining the portion of thelight that is not scattered and reaches the camera. Whenthe atmosphere is homogeneous, the transmission α can beexpressed as:

α(x) = e−βz(x), (2)

where β is the scattering coefficient depending on the den-sity of the media, and z is the distance from the scene pointto the camera center. To simplify the formulation, we as-sume that the scene point depth can approximate z well asin [7, 19].

3.2. Stereo from monocular videos

Assume n continuous frames I = {It|t = 1, . . . , n}with known camera parameters C = {Kt,Rt, tt|t =1, . . . , n}, where Kt is the intrinsic matrix, Rt is the ro-tation matrix and tt is the translation vector. These camer-a parameters can be estimated by any standard structure-from-motion (SfM) methods [18, 40]. We follow [43]in formulating the problem of video-based stereo recon-struction, which aims to estimate the inverse depth map-s D = {Dt|t = 1, . . . , n} for all the frames. That is,Dt(x) = 1/Zt(x), and Zt(x) is the depth of pixel x inframe t. To formulate the problem into a generic randomfield for dense image labeling, the continuous value of Dt isdiscretized into equal steps within some range [dmin, dmax].The energy function then takes the following form:

E(D) =

n∑t=1

(Ep(Dt) + ηEg(Dt) + ρEs(Dt)), (3)

where Ep(Dt) is the photoconsistency term, Eg(Dt) is thegeometric coherence term, Es(Dt) is the smoothness term,and η and ρ are the parameters to balance these terms.

In order to define the photoconsistency term, we assumex is written in the homogeneous coordinate and derive frommulti-view geometry the following projection function:li→j(x, Di(x)) = KjRjR

Ti K

−1i x+Di(x)Kj(tj −RjR

Ti ti),

(4)which projects the pixel x with inverse depth Di(x) in

frame i to frame j. Now we can write the photoconsistencyterm as

Ep(Dt) =1

|N (t)|∑

t′∈N (t)

∑x

∥It(x)− It′(lt→t′(x, Dt(x)))∥,

(5)where N (t) denotes the neighboring frames of t and |N (t)|is the number of frames in the neighboring set. As in many

classic stereo algorithms, this term measures the photocon-sistency of frame t and its neighboring frames.

The geometric coherence term is specifically designedfor video-based stereo [43] to ensure temporal consistencyof recovered depth maps and to handle occlusions. It is stilla unary cost but it checks the inverse depths of the conjugatepixels in neighboring frames:

Eg(Dt) =1

|N (t)|∑

t′∈N (t)

∑x

∥x− lt′→t(x′, Dt′(x

′))∥, (6)

where x′ = lt→t′(x, Dt(x)) is the conjugate pixel locationof x in frame t′ and Dt′ is the inverse depth map of frame t′.This term essentially enforces the geometric consistency ofdepth maps of different frames, contributing to the temporalconsistency of the final result.

Typically, the smoothness term is defined as:

Es(Dt) =∑x

∑y∈N (x)

w(x,y) · f(Dt(x), Dt(y)). (7)

For robustness, f(Dt(x), Dt(y)) is usually the truncatedℓ1 function:

fℓ1(Dt(x), Dt(y)) = min{∥Dt(x)−Dt(y)∥, τ1} (8)

where τ1 is the truncating parameter and w(x,y) is theweight function indicating the probability that x and yshould be assigned the same inverse depth. To encouragethe depth discontinuity to be coincident with color change,w(x,y) is usually defined based on the color difference ofneighboring pixels [3, 4, 33].

Since it is difficult to achieve global optimality for Equa-tion (3), video-based stereo usually adopts a two-step strat-egy. Firstly, the depth initialization step solves Equation (3)frame by frame by omitting the geometric coherence termto obtain the initial depth maps. Then, the estimated ini-tial depth maps are fed into the second step to solve thecomplete version of Equation (3). More specifically, it nowsolves one depth map at a time by fixing the depth maps ofthe other frames, iterating several passes (typically 2 pass-es) with one pass traversing all frames once. These prob-lems are standard MRF minimization problems that admitefficient solutions like graph cut [4] or loopy belief propa-gation (LBP) [11]. For more details, interested readers canrefer to [43].

4. Simultaneous Defogging and Stereo

It is often difficult to measure the photoconsistency in afoggy video, because the scene radiance is attenuated dif-ferently from different camera positions. To overcome thisdifficulty, we propose a more sophisticated photoconsisten-cy term which takes the scattering effect into consideration.Meanwhile, the presence of fog also opens up the possibilityof enriching the details of the reconstructed depth. For thispurpose, we include the matting Laplacian [21] constraintas a detail preserving smoothness term. And lastly, we also

(a)

(b) (c)

Figure 1. The new photoconsistency term: (a) A source frame fromthe “Bali” data with two heavily attenuated pixels, (b) The datacost at pixel 1. (c) The data cost at pixel 2. The green squaresmark the true inverse depth (manually verified by projecting toother frames).

leverage on the ordering constraint provided by the trans-mission when imposing smoothness on the inverse depths.Consequently, our new energy function takes the followingform

E(D) =n∑

t=1

(Eps(Dt) + ηEg(Dt)+ ρEso(Dt) + λELap(Dt)),

(9)where Eg(Dt) is the unchanged geometric coherence ter-

m, and Eps(Dt), Eso(Dt) and ELap(Dt) correspond to thenew photoconsistency term, the new smoothness term aug-mented with an ordering constraint, and the matting Lapla-cian term, respectively. Jointly optimizing these terms helpsbring mutual benefits to both the stereo depth recovery anddefogging. In particular, ambiguity of faraway objects arelifted, and both depth and color details are better recovered.In the following subsections, we will explain the new ener-gy terms in detail.

4.1. Photoconsistency

Currently let us assume that the scattering coefficient βand atmospheric light A are known. We will present an au-tomatic method to estimate them in Section 4.5. DenoteRi = [ri,1 ri,2 ri,3]

T and ti = [ti,1 ti,2 ti,3]T , where

{rTi,k|k = 1, 2, 3} are the rows of Ri and {ti,k|k = 1, 2, 3}are the entries of ti. Now we can derive the projectionfunction πi→j(x, αi(x)) which computes the correspond-ing transmission value in the j-th frame for the pixel x inthe i-th frame with transmission αi(x). Specifically,πi→j(x, αi(x))=exp(rTi,jK

−1i x log(αi(x))+β(rTi,jti− tj,3)),

(10)where rTi,j = [rTj,3ri,1 rTj,3ri,2 rTj,3ri,3] is the last row ofRjR

Ti . More intuitively, Equation (10) can be interpreted

as follows. Knowing αi(x) means knowing Di(x) because

they can be converted to each other by Equation (2). Thus,together with the camera parameters, the 3D world coordi-nate Xworld of pixel x can be calculated. With the worldcoordinate Xworld, we can calculate its depth in the j-thcamera, and hence the transmission πi→j(x, αi(x)). Notethat we use depth Zi(x) instead of distance zi(x) in Equa-tion (2) to simplify Equation (10), though in principal thedistance zi(x) can be used too.

Now we can define the new photoconsistency term thatis corrected for scattering effect:

Eps(Dt)=1

|N (t)|∑

t′∈N (t)

∑x

∥It′(x)− It′(lt→t′(x, Dt(x)))∥,

(11)where It′(x) = (It(x) − A)πt→t′ (x,αt(x))

αt(x)+ A and com-

puting it can be interpreted as synthesizing the attenuatedappearance of pixel of x in the t′ frame with given transmis-sion αt(x). Note that αt(x) can be related to Dt(x) fromEquation (2), so Dt is the only unknown in Equation (11).For the same reason, the following functions defined on αt

will also be considered as functions of Dt unless specifical-ly pointed out.

Figure 1 (b)(c) show the values of our improved data ter-m at the two points marked in Figure 1(a). Since these far-away points are highly attenuated and thus suffer from lowimage contrast, the conventional data term does not workand tends to assign incorrect depth values to these points.In comparison, the new photoconsistency cost shows a clearminimum at the position of the true inverse depth (markedby a green square in Figure 1 (b) and (c)).

4.2. Laplacian smoothing

A fog transmission map should satisfy the Laplacian s-moothness prior [17, 21]. Concerning this, we find that thisprior not only refines the transmission map, but also helps topreserve details in the depth map, probably due to its closerelation to spectral image segmentation.

The Laplacian term is defined asELap(Dt) = vec(αt)

TLtvec(αt), (12)

where vec(αt) converts αt into vector form, and Lt is theLaplacian matrix with its (i, j)-th entry defined as [21]

Lt(i, j) =∑k|(i,j)∈wk

(δij−1

|wk|(1+(It(xi)−µk)(Σk+

ε

|wk|I3)

−1(It(xj)−µk)))

(13)

where δij is the Kronecker delta, µk is a 3× 1 mean vectorof the colors in a window wk, Σk is a 3×3 covariance matrixof the colors in wk, ε is a regularizing parameter, and I3 isthe 3 × 3 identity matrix. Equation (13) sums over all the3 × 3 windows wk in which the i-th and j-th pixels bothappear.

We demonstrate the effectiveness of this Laplacian s-

(a) (b)

Figure 2. The Laplacian term: (a) and (b) are the inverse depth (ofFigure 1 (a)) produced without and with the Laplacian term.

moothing in Figure 2. When the Laplacian term is not en-abled, shape details on the nearby pole are missing, most-ly because of the shrinking biased caused by the belief-propagation based optimization. In comparison, our Lapla-cian term can capture these details.

4.3. Ordering constraint based on transmission

The fog transmission conveys more reliable constrainton depth order between points than on their absolute depthvalues. We further leverage on this aspect of fog informa-tion. More specifically, assume x and y are two neighbor-ing pixels. If αt(x) > αt(y), we expect Dt(x) ≥ Dt(y).Theoretically, it is always true in Equation (9). Howev-er, it may disobey the ordering constraint, since we adop-t an alternating optimization method (Section 4.4), whichdecouples αt(x) and Dt(x). Thus, when this condition isviolated, we assign a large penalty τ2. Mathematically, wemodify f(Dt(x), Dt(y)) in Equation (7) as:

fo(Dt(x), Dt(y)) ={τ2 δ(αt(x)−αt(y))·δ(Dt(x)−Dt(y))=−1,

fℓ1(Dt(x), Dt(y)) otherwise,(14)

where δ(·) is the sign function that returns 1 for positivevalues, -1 for negative values and 0 for 0 values. Replac-ing fℓ1 with fo in Equation (7), we denote the resultant s-moothness term with ordering constraint as Eso(Dt). Notethat the weight function w(x,y) remains unchanged. It hasbeen shown that when the transmission αt is known, thissmoothness function remains a metric [8] and is thus solv-able by α-expansion.

Figure 3 shows the advantage of enforcing the orderingconstraint. In Figure 3 (b), the inverse depth in the sky inthe red box is wrong, while it is corrected in Figure 3 (c)by our ordering constraint. These results are produced bysolving Equation (9), where the only difference is whetherthe smoothness term Eso(Dt) is used.

4.4. Solver

Following [43], we also adopt a two-step optimizationstrategy. We initialize the depth maps by ignoring the ge-ometric coherence term in the first step, and then solve thecomplete version of Equation (9) iteratively in the secondstep. Equation (9) is not easy to solve because of the Lapla-cian term. Thus, we adopt an alternating optimization strat-

(b) (c)

(a)

Figure 3. The ordering constraint: (a) A source frame from the“Blenheim” data. (b) and (c) are the inverse depths without andwith the ordering constraint.

egy with half quadratic splitting [14], based on the idea ofintroducing an auxiliary variable to decouple the terms andupdate them alternatingly. This strategy is widely used inmany computer vision algorithms, particularly in the totalvariation (TV) regularizations [39]. Although it originallysolves convex problems and there is no general convergencetheory applied to non-convex problems, it has been appliedto several non-convex problems [20, 41] and shown to per-form well. Indeed, empirically we found our algorithm alsohad strong convergence behaviour.

We split the function into two parts: one is a functionof inverse depth, which can be minimized by MRF energyminimization; the other one is a function of transmission,which is a convex function and has a closed-form solution.We further introduce a coupling term to enforce the consis-tency of Dt and αt. Thus the energy function is rewrittenas

E(D) =

n∑t=1

(Eps(Dt) + ηEg(Dt) + ρEso(Dt)+

ϵ∥e−βDt − αt∥2F + λELap(αt)).

(15)

We minimize the new objective function iteratively untilconvergence or the number of iteration exceeds the maxi-mum limit. In each iteration, we solve for Dt while fixingαt, and then solve for αt while fixing Dt. Thus, the twosubproblems are

min

n∑t=1

(Eps(Dt)+ηEg(Dt)+ρEso(Dt)+ϵ∥e−βDt−αt∥2F ), (16)

min

n∑t=1

(ϵ∥e−βDt − αt∥2F + λELap(αt)). (17)

Since the last term in Equation (16) is unary, and Eso(Dt)with the ordering constraint introduces no additional diffi-culty, Equation (16) can be solved by graph cut [4] or LBP[11]. The second subproblem Equation (17) is an uncon-strained convex problem and has closed-form solution:

vec(α∗t ) = (I+

λ

ϵLt)

−1ut, (18)

where ut = vec(e−βDt ) and I is the identity matrix.

(b)(a)

Figure 4. The histograms of β for: (a) Bali and (b) Synthetic.

4.5. Estimating A and β

Tan [35] uses the brightest pixel value as the atmosphericlight A. Later the estimation of A is further refined in [10,17]. We follow the method in [17] to estimate A because ofits robustness.

We then estimate β, given A and a set of sparselymatched pixel pairs {(xk,yk)|k = 1, . . . ,K} in the i-thand j-th frames. These points are readily collected from thestructure-from-motion step where feature correspondencesare established to recover camera motion. They satisfy thefollowing equations according to Equation (1),

Ii(xk)−A = (Ji(xk)−A)αi(xk),Ij(yk)−A = (Jj(yk)−A)αj(yk).

(19)

Since xk and yk correspond to the same scene point, weknow Ji(xk) − A = Jj(yk) − A by photo-consistency.Then we have

Ii(xk)−A

Ij(yk)−A=αi(xk)

αj(yk)=

exp(−βZi(xk))

exp(−βZj(yk))

= exp(−β(Zi(xk)− Zj(yk))),

(20)

where Zi(xk) is the depth (or distance) and readily knownfrom the structure-from-motion step. Note that each pix-el pair gives an estimation of β from Equation (20). Forstability, we disregard those pairs where the inverse depthdifference is smaller than some threshold (typically 10−3).Thus, we build a histogram from the collection of β andchoose the value of highest bin as β. Several examples ofthe histogram are shown in Figure 4, from which we can seethe nice distribution of the estimations.

5. Experimental ResultsIn our experiments, the camera parameters for all frames

in all data are estimated beforehand. Because of the scat-tering effect, we use SIFT detector and descriptor [22] tomatch feature points in videos. We find that most featuresare extracted from near objects, which are less degraded byscattering media, thus most matched features are confident.Then we use the SfM method proposed in [18] to recovercamera poses. With the estimated camera poses, we firstconduct a depth initialization step by solving Equation (9)without the geometric coherence term, and then solve thecomplete version of Equation (9) in the next step. The E-quation (9) (with or without the geometric coherence ter-m) is solved by iteratively solving Equation (16) and Equa-

tion (17). During the depth initialization step, we do nothave the fog transmission map at our disposal yet, so theresult of the dark channel method [17] is used as an initial-ization for α to apply the ordering constraint. In the subse-quent iterations, the fog transmission map is kept updatedfrom the estimated depth map via Equation (17).

We evaluate our method on several challenging videos.The videos are captured from different localities on foggydays. Among these videos, the “Blenheim” data is down-loaded from Vimeo. We also captured a turbid underwatervideo in a tank using an underwater camera. To simulate thescattering medium, we poured milk in the water. Completevideo results can be found in the supplementary material.All the experiments are run on a desktop with Intel quad-core 2.4GHz CPU. The time taken to process a frame with480× 270 image resolution is about 10 minutes.

For the stereo evaluation, we compare with the state-of-the-art video-based stereo reconstruction [43]. For thedefogging evaluation, we compare to several state-of-the-art single image defogging methods, i.e. the dark channelmethod by He et al. [17], the latest algorithm by Menget al. [24] and the improved version of [36], denoted asNBPC+PA [37]. For a more fair comparison, we also e-valuate the approaches adopting both fog and stereo cues;these include our own implementation of Caraffa and Tarel(referred as CT) [7] and two straightforward combinationsof defogging and stereo (similar to [27, 30]). The straight-forward combinations simply iterate between defogging (byMeng et al. [24]) and stereo (by Zhang et al.[43]) with d-ifferent initializations. We denote the one starting with thestereo result as SD (Stereo then defog) and the other one asDS (Defog then stereo). The iteration stops when there areno big changes in the results (usually 5 iterations).

We first verify our algorithm on a synthetic data. The e-valuation metric is the error computed using the sum of theabsolute difference between the recovered depth (or defog-ging image) and the groundtruth. The quantitative compar-isons are shown in Table 1, from which it is observed thatour approach outperforms the others with the smallest er-rors. In the following, we show the qualitative comparisonsand analyze the strong aspects of our method.

(a) (b) (c)

(d) (e) (f)

Figure 5. Comparison with conventional stereo method. (a)Groundtruth inverse depth. (b) Result from Zhang et al. [43].(c) Our result. (d) Source image. (e) and (f) are the error maps ofZhang et al. [43] and ours respectively

(a) (b) (c)

(d) (e) (f)

Figure 6. Comparison with conventional defogging methods. (a)Clear image. (b) Result from Meng et al. [24]. (c) Our result. (d)Source image. (e) and (f) are the error maps of Meng et al. [24]and ours respectively.

Figure 5 shows the evaluation of stereo estimation on thesynthetic data. The estimated inverse depth map are nor-malized to [0, 1] for comparison. As can be observed, ourresult is considerably close to the ground truth for both nearand far regions, while Zhang et al. [43] cannot distinguishthe depth of the ground and walls very well due to the lowcontrast. Thus, these regions are wrongly assigned closeinverse depths.

For the defogging evaluation on the synthetic data, it isobserved in Figure 6 that our method produces the closestresult to the groundtruth according to the error map. In con-trast, Meng et al. [24] (the results of other two single imagedefogging methods are similar) shows significant errors inthe ground region, which is light-toned in color and is thusmistaken (due to the respective priors employed) as farawayobjects densely covered with strong fog. The fog of the dis-tant region in the middle of the image is not completely re-moved by these single image based method. Furthermore,single image defogging methods generate temporal flicker-ing on videos, since each frame is processed independently.Please refer to the supplementary video.

We then compare our algorithm with the others on real-world data as shown in Figure 7. Comparing only to thestereo algorithm [43] (second column, every even row), itis observed from Figure 7 that [43] is not able to recoverthe depths of the distant objects, which are highly degrad-ed by the scattering medium, such as the house in “Bali”,the distant trees in “Blenheim” and the castle in “Underwa-ter”. In contrast, our method recovers the structures of thehouse, trees and castle more correctly. Moreover, [43] gen-erates some artifacts in these data, such as the plant regionin “Bali”, the sky region in “Blenheim”, the door region in“Motorcycle” and the region near the chest in “Underwa-ter”. In comparison, our method is free of such artifacts.Our method preserves the details of the scene, such as theelongated steel frame embedded in the foreground concretecolumn in “Bali”, the more complicated geometrical detailsof trees and bushes in “Blenheim”, and the furrows on thetreasure chest in “Underwater”.

Comparing only to the defogging algorithm [24] (secondcolumn, every odd row), it is observed from Figure 7 that

Table 1. Comparison on the synthetic data. The error (per frame)of the stereo and defogging results are presented.

[17] [24] [37] [43] CT[7] DS SD ourStereo N/A N/A N/A 76.42 25.44 23.10 50.63 8.59Defogging 29.42 31.68 33.98 N/A 11.92 8.32 14.36 6.12

our method is able to recover more faithful colors, particu-larly in the light-toned ground regions of the “motorcycle”and “playground” sequences, which again cause problemsfor single image defogging. Moreover, the fog in front ofthe right door of “Motorcycle” is not completely removedby [24] and recovered scene of “underwater” is darker. Incontrast, our method handles these regions well thanks tothe high quality depth estimated by our stereo algorithm.

Comparing to the two straightforward combinations ofdefogging and stereo, it is observed from Figure 7 that sepa-rate defogging and stereo does not necessarily improve eachother much. For DS, though the initial defogged imageslook visually plausible (shown in the second column of Fig-ure 7), they are actually inconsistent over frames and leadto poor stereo results. For instances, the stereo results ofthe region near the pillar in “Bali” and sky in “Blenheim”become much worse. The SD approach suffer from simi-lar problems. Its stereo results on “Bali” and “Blenheim”become worse after iteration due to similar inconsistencyproblems. They also fail to recover depths of distant objectsand shape details.

Comparing to CT [7], it is observed from Figure 7 thatour methods outperform theirs in both defogging and stere-o. Their method still employs the conventional photo-consistency term as a stereo initialization, which becomesless effective in scattering media. Their simple addition ofthe stereo and defogging terms cannot reconstruct distantobjects and depth details of the scene. Their defogging re-sults are also worse compared to the others since the directinverse of scattering is sensitive to radiometric calibration.

6. ConclusionWe formulate simultaneous video defogging and stere-

o reconstruction as a unified energy minimization prob-lem. Our improved photoconsistency term explicitly mod-els the scattering effect and makes stereo matching morerobust. The matting Laplacian constraint helps to preserveshape details, especially those thin and elongated structures,which are well-known challenges for conventional stereoalgorithms. We further enforce the relative depth order atneighboring pixels to be consistent with their relative fogthickness. As a result, our stereo method estimates highquality depth for scenes in scattering media, and producestemporally consistent video with enhanced visibility. Ex-periments on the synthetic and real data demonstrate the su-perior performance of our method over recent algorithms onboth defogging and multi-view stereo.

Bali

Blenheim

Motocycle

Stele

Playground

Underwater

DS SD CT [7] OurMeng et al. [24]/Zhang et al. [43]

Figure 7. Comparisons of defogging and stereo results on real-world videos. First column shows the source images and the correspondingnames of the data. Every odd row shows the defogging results and every even rows show the stereo results. The last row presents the nameof the compared methods, corresponding to each column. Note that the second column shows the defogging result of Meng et al. [24] andstereo result of Zhang et al. [43].

Acknowledgements. This work was partially support-ed by R-263-000-A87-720 from Adobe Systems andSingapore PSF grant 1321202075. Ping Tan is spon-sored by the Canada NSERC Discovery (No. 611664)and Discovery Acceleration Supplement (No. 611663).

References[1] http://vision.middlebury.edu/mview/. 2[2] C. Ancuti, C. O. Ancuti, T. Haber, and P. Bekaert. Enhancing

underwater images and videos by fusion. In Proc. CVPR,2012. 2

[3] A. F. Bobick and S. S. Intille. Large occlusion stereo. Inter-national Journal of Computer Vision, 33(3):181–200, 1999.3

[4] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-ergy minimization via graph cuts. IEEE Trans. Pattern Anal.Mach. Intell., 23(11):1222–1239, 2001. 2, 3, 5

[5] D. Bradley, T. Boubekeur, and W. Heidrich. Accurate multi-view reconstruction using robust binocular stereo and surfacemeshing. In Proc. CVPR, 2008. 2

[6] N. D. F. Campbell, G. Vogiatzis, C. Hernandez, and R. Cipol-la. Using multiple hypotheses to improve depth-maps formulti-view stereo. In Proc. ECCV, 2008. 2

[7] L. Caraffa and J.-P. Tarel. Stereo reconstruction and contrastrestoration in daytime fog. In Proc. ACCV. 2012. 2, 3, 6, 7

[8] P. Carr and R. Hartley. Improved single image dehazing us-ing geometry. In Proc. DICTA, pages 103–110, 2009. 5

[9] F. G. Cozman and E. Krotkov. Depth from scattering. InProc. CVPR, 1997. 3

[10] R. Fattal. Single image dehazing. ACM Trans. Graph. (Proc.of SIGGRAPH), 27(3), 2008. 1, 2, 3, 6

[11] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient beliefpropagation for early vision. International Journal of Com-puter Vision, 70(1):41–54, 2006. 3, 5

[12] Y. Furukawa and J. Ponce. Accurate, dense, and robust mul-tiview stereopsis. IEEE Trans. Pattern Anal. Mach. Intell.,32(8):1362–1376, 2009. 2

[13] D. Gallup, J.-M. Frahm, and M. Pollefeys. Piecewise pla-nar and non-planar stereo for urban scene reconstruction. InProc. CVPR, 2010. 2

[14] D. Geman and G. Reynolds. Constrained restoration and therecovery of discontinuities. IEEE Trans. Pattern Anal. Mach.Intell., 14(3):367–383, 1992. 5

[15] M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S. M.Seitz. Multi-view stereo for community photo collections.In Proc. ICCV, 2007. 2

[16] R. Hartley and A. Zisserman. Multiple View Geometry inComputer Vision. Cambridge University Press, 2004. 1

[17] K. He, J. Sun, and X. Tang. Single image haze removal usingdark channel prior. IEEE Trans. Pattern Anal. Mach. Intell.,33(12):2341–2353, 2011. 1, 2, 3, 4, 6, 7

[18] N. Jiang, Z. Cui, and P. Tan. A global linear method forcamera pose registration. In Proc. ICCV, 2013. 3, 6

[19] J. Kopf, B. Neubert, B. Chen, M. Cohen, D. Cohen-Or,O. Deussen, M. Uyttendaele, and D. Lischinski. Deep photo:

Model-based photograph enhancement and viewing. ACMTrans. Graph. (Proc. of SIGGRAPH Asia), 27(5), 2008. 1, 3

[20] D. Krishnan and R. Fergus. Fast image deconvolution usinghyper-laplacian priors. In Proc. NIPS, 2009. 5

[21] A. Levin, D. Lischinski, and Y. Weiss. A closed-form so-lution to natural image matting. IEEE Trans. Pattern Anal.Mach. Intell., 30(2):228–242, 2008. 1, 2, 3, 4

[22] D. Lowe. Distinctive image features from scale-invariantkeypoints. International Journal of Computer Vision,60(2):91–110, 2004. 6

[23] C. Mei, G. Sibley, M. Cummins, P. Newman, and I. Reid.A constant-time effient stereo slam system. In Proc. BMVC,2009. 1

[24] G. Meng, Y. Wang, J. Duan, S. Xiang, and C. Pan. Efficien-t image dehazing with boundary constraint and contextualregularization. In Proc. ICCV, 2013. 2, 6, 7, 8

[25] S. G. Narasimhan and S. K. Nayar. Contrast restoration ofweather degraded images. IEEE Trans. Pattern Anal. Mach.Intell., 25(6):713–724, 2003. 2, 3

[26] S. G. Narasimhan, S. K. Nayar, B. Sun, and S. J. Koppal.Structured light in scattering media. In Proc. ICCV, pages420–427, 2005. 2

[27] E. Nascimento, M. Campos, and W. Barros. Stereo basedstructure recovery of underwater scenes from automaticallyrestored images. In Computer Graphics and Image Process-ing (SIBGRAPI), 2009 XXII Brazilian Symposium on, 2009.2, 6

[28] K. Nishino, L. Kratz, and S. Lombardi. Bayesian defog-ging. International Journal of Computer Vision, 98(3):263–278, 2012. 2

[29] D. Nister, O. Naroditsky, and J. Bergen. Visual odometry. InProc. CVPR, 2004. 1

[30] M. Roser, M. Dunbabin, and A. Geiger. Simultaneous un-derwater visibility assessment, enhancement and improvedstereo. In International Conference on Robotics and Automa-tion (ICRA), 2014. 2, 6

[31] Y. Y. Schechner, S. G. Narasimhan, and S. K. Nayar. Instantdehazing of images using polarization. In Proc. CVPR, 2001.2

[32] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, andR. Szeliski. A comparison and evaluation of multi-viewstereo reconstruction algorithms. In Proc. CVPR, 2006. 1, 2

[33] C. Strecha, R. Fransens, and L. J. V. Gool. Wide-baselinestereo from multiple views: A probabilistic account. In Proc.CVPR. 3

[34] C. Strecha, R. Fransens, and L. Van Gool. Combined depthand outlier estimation in multi-view stereo. In Proc. CVPR,2006. 2

[35] R. Tan. Visibility in bad weather from a single image. InProc. CVPR, 2008. 1, 2, 3, 6

[36] J.-P. Tarel and N. Hautiere. Fast visibility restoration from asingle color or gray level image. In Proc. CVPR, 2009. 2, 6

[37] J.-P. Tarel, N. Hautiere, A. Cord, D. Gruyer, and H. Hal-maoui. Improved visibility of road scene images under het-erogeneous fog. In IEEE Intelligent Vehicles Symposium(IV), 2010. 6, 7

[38] T. Treibitz and Y. Y. Schechner. Active polarization descat-tering. IEEE Trans. Pattern Anal. Mach. Intell., 31(3):385–399, 2009. 2

[39] Y. Wang, J. Yang, W. Yin, and Y. Zhang. A new alternatingminimization algorithm for total variation image reconstruc-tion. SIAM J. Imaging Sciences, 1(3):248–272, 2008. 5

[40] C. Wu, S. Agarwal, B. Curless, and S. M. Seitz. Multicorebundle adjustment. In Proc. CVPR, 2011. 3

[41] L. Xu, C. Lu, Y. Xu, and J. Jia. Image smoothing via L0 gra-dient minimization. ACM Trans. Graph., 30(6):174, 2011.5

[42] C. Zhang, Z. Li, R. Cai, H. Chao, and Y. Rui. As-rigid-as-possible stereo under second order smoothness priors. InProc. ECCV, 2014. 1

[43] G. Zhang, J. Jia, T.-T. Wong, and H. Bao. Consistent depthmaps recovery from a video sequence. IEEE Trans. PatternAnal. Mach. Intell., 31(6):974–988, 2009. 2, 3, 5, 6, 7, 8

Documents

Simultaneous Video Defogging and Stereo Reconstruction€¦ · Simultaneous Video Defogging and Stereo Reconstruction Zhuwen Li1, Ping Tan2, Robby T. Tan3, Danping Zou4, Steven Zhiying