5
Supplementary Materials for Learning to Parse Wireframes in Images of Man-Made Environments Kun Huang 1 , Yifan Wang 1 , Zihan Zhou 2 , Tianjiao Ding 1 , Shenghua Gao 1 , and Yi Ma 3 1 ShanghaiTech University {huangkun, wangyf, dingtj, gaoshh}@shanghaitech.edu.cn 2 The Pennsylvania State University [email protected] 3 University of California, Berkeley [email protected] 1. Wireframe Construction Algorithm Detail Given an image, our wireframe construction algorithm takes a set of junctions {p i } N i=1 , p i = ( x i , {θ ik } Ki k=1 ) , and a line heat map h as input. Note that for the junctions and their branches predicted by our network, we only keep those with confidence scores higher than certain thresholds τ c and τ b , respectively. As a pre-processing step, we further adopt a strategy similar to non-maximum suppression to remove duplicate detections. Our wireframe construction algorithm is presented in Alg. 1. In the algorithm, we first apply a threshold ω to con- vert the line heat map h(p) into a binary map M (line 2). Note that this threshold ω is varied to obtain the precision- recall curve in our experiments on wireframe construction. The algorithm then proceeds as follows: First, we connect all pairs of junctions which are aligned with each other’s branch directions (lines 3- 22). Let r ik represent the ray starting at p i along its k-th branch. We collect all possible rays as R = {r 11 , ..., r 1K1 , ..., r i1 , ..., r iKi , ...}, and use (i, k)= π(t) to map the t-th ray in R to its junction index i and branch index k. Then, for the rays in R, we use V∈ R Nr×Nr , N r = |R|, to record the indices of the corresponding ray/branch of the closest opposite junction. Specifically, t 1 ∈{1,...,N r }, we set V (t 1 ,t 2 ) to 1 if and only if (i) p i is the on the ray r jk2 and p j is on the ray r ik1 , where (i, k 1 )= π(t 1 ), (j, k 2 )= π(t 2 ), and (ii) the distance be- tween p i and p j is the shortest among all such aligned pairs (lines 5-15). Then, we consider two rays are matched if V (t 1 ,t 2 )= V (t 2 ,t 1 )=1 and add the corresponding junc- tions and line segments to P and L, respectively (lines 17- 21). Second, for any ray r ik which fails to find a matching ray using the above procedure, we attempt to recover additional line segments using the line support M (lines 23-38). We consider the following cases: (a) If the distance between p i and q b , the intersection of r ik and the image boundary, is smaller than certain threshold (say 0.05 × m where m is the maximum of image width and height), we add {p i , q b } and the con- necting line segment to P and L, respectively (lines 24-26). (b) For a ray exceeding the length threshold in (a), we first find the farthest line pixel q M along the ray on M. Then, we find all the intersection points {q 1 ,..., q S } of line segment (p i , q M ) with existing segments in L (lines 28-29). Let q 0 = p i and q S+1 = q M , we calculate the line support ratio κ(q s-1 , q s ),s = {1,...,S,S +1}, for each segment. Here, κ is de- fined as the ratio of line pixels (pixel p is a line pixel if M(p)=1) to the total length of the segment. If κ is above a threshold, say 0.6, we add the segment to L and its endpoints to P (lines 30-36). 2. Additional Experiments 2.1. Experiment on Junction Detection Network Pa- rameters In this section, we examine the choices of two important hyper-parameters in our junction detection network. Effect of balancing positive and negative samples. In this experiment, we vary the value r max , which controls the maximum ratio between negative and positive samples at each iteration. Note that setting r max = is equiva- lent to using all grid cells during training. We can observe in Figure 1(a) that the precision-recall curves largely over- lap. But as r max increases, the curve shifts toward the high- precision-low-recall regime, and vice versa. For example, when r max =1, the precision and recall at τ =0.5 are 0.19 and 0.94, respectively. And when r max = , the precision and recall at τ =0.5 are 0.70 and 0.44, respectively. Note that this has an important implication in practice, as human annotators tend to miss true junctions much more often than labelling wrong junctions. Empirically, we have found that r max =7 yields more satisfactory results. 1

Supplementary Materials for Learning to Parse Wireframes ... · 1. Wireframe Construction Algorithm Detail Given an image, our wireframe construction algorithm takes a set of junctions

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Supplementary Materials for Learning to Parse Wireframes ... · 1. Wireframe Construction Algorithm Detail Given an image, our wireframe construction algorithm takes a set of junctions

Supplementary Materials forLearning to Parse Wireframes in Images of Man-Made Environments

Kun Huang1, Yifan Wang1, Zihan Zhou2, Tianjiao Ding1, Shenghua Gao1, and Yi Ma31ShanghaiTech University {huangkun, wangyf, dingtj, gaoshh}@shanghaitech.edu.cn

2The Pennsylvania State University [email protected] of California, Berkeley [email protected]

1. Wireframe Construction Algorithm DetailGiven an image, our wireframe construction algorithm

takes a set of junctions {pi}Ni=1, pi =(xi, {θik}Ki

k=1

), and

a line heat map h as input. Note that for the junctions andtheir branches predicted by our network, we only keep thosewith confidence scores higher than certain thresholds τc andτb, respectively. As a pre-processing step, we further adopta strategy similar to non-maximum suppression to removeduplicate detections.

Our wireframe construction algorithm is presented inAlg. 1. In the algorithm, we first apply a threshold ω to con-vert the line heat map h(p) into a binary map M (line 2).Note that this threshold ω is varied to obtain the precision-recall curve in our experiments on wireframe construction.The algorithm then proceeds as follows:

First, we connect all pairs of junctions which arealigned with each other’s branch directions (lines 3-22). Let rik represent the ray starting at pi along itsk-th branch. We collect all possible rays as R ={r11, ..., r1K1

, ..., ri1, ..., riKi, ...}, and use (i, k) = π(t)

to map the t-th ray in R to its junction index i and branchindex k. Then, for the rays in R, we use V ∈ RNr×Nr ,Nr = |R|, to record the indices of the correspondingray/branch of the closest opposite junction. Specifically,∀t1 ∈ {1, . . . , Nr}, we set V(t1, t2) to 1 if and only if (i)pi is the on the ray rjk2

and pj is on the ray rik1, where

(i, k1) = π(t1), (j, k2) = π(t2), and (ii) the distance be-tween pi and pj is the shortest among all such aligned pairs(lines 5-15). Then, we consider two rays are matched ifV(t1, t2) = V(t2, t1) = 1 and add the corresponding junc-tions and line segments to P and L, respectively (lines 17-21).

Second, for any ray rik which fails to find a matching rayusing the above procedure, we attempt to recover additionalline segments using the line supportM (lines 23-38). Weconsider the following cases:

(a) If the distance between pi and qb, the intersection ofrik and the image boundary, is smaller than certain

threshold (say 0.05 ×m where m is the maximum ofimage width and height), we add {pi, qb} and the con-necting line segment to P and L, respectively (lines24-26).

(b) For a ray exceeding the length threshold in (a), we firstfind the farthest line pixel qM along the ray on M.Then, we find all the intersection points {q1, . . . , qS}of line segment (pi, qM) with existing segments inL (lines 28-29). Let q0 = pi and qS+1 = qM,we calculate the line support ratio κ(qs−1, qs), s ={1, . . . , S, S + 1}, for each segment. Here, κ is de-fined as the ratio of line pixels (pixel p is a line pixelifM(p) = 1) to the total length of the segment. If κis above a threshold, say 0.6, we add the segment to Land its endpoints to P (lines 30-36).

2. Additional Experiments2.1. Experiment on Junction Detection Network Pa-

rameters

In this section, we examine the choices of two importanthyper-parameters in our junction detection network.

Effect of balancing positive and negative samples. Inthis experiment, we vary the value rmax, which controlsthe maximum ratio between negative and positive samplesat each iteration. Note that setting rmax = ∞ is equiva-lent to using all grid cells during training. We can observein Figure 1(a) that the precision-recall curves largely over-lap. But as rmax increases, the curve shifts toward the high-precision-low-recall regime, and vice versa. For example,when rmax = 1, the precision and recall at τ = 0.5 are 0.19and 0.94, respectively. And when rmax =∞, the precisionand recall at τ = 0.5 are 0.70 and 0.44, respectively. Notethat this has an important implication in practice, as humanannotators tend to miss true junctions much more often thanlabelling wrong junctions. Empirically, we have found thatrmax = 7 yields more satisfactory results.

1

Page 2: Supplementary Materials for Learning to Parse Wireframes ... · 1. Wireframe Construction Algorithm Detail Given an image, our wireframe construction algorithm takes a set of junctions

Algorithm 1 Wireframe Construction

Input: Junctions {pi}Ni=1, pi =(xi, {θik}Ki

k=1

), and a line

heat map h(p)Output: Wireframe W consisting of a set of junction

points P connected by a set of line segments L1: Initialize P ← Ø, L← Ø,V ← 02: Binarize h(p) with threshold ω intoM3: for t1 ∈ {1, 2, . . . , Nr} do4: (i, k1)← π(t1), dmin ←∞, z ← 05: for t2 ∈ {1, 2, . . . , Nr} do6: (j, k2)← π(t2)7: if j 6= i and pj on rik1 and pi on rjk2 then8: if ‖xi − xj‖ < dmin then9: dmin ← ‖xi − xj‖, z ← t2

10: end if11: end if12: end for13: if z 6= 0 then14: V(t1, z)← 115: end if16: end for17: for all t1, t2 ∈ {1, 2, . . . , Nr}, t1 6= t2 do18: if V(t1, t2) = 1 and V(t2, t1) = 1 then19: (i, k1)← π(t1), (j, k2)← π(t2)20: P ← P

⋃{pi,pj}, L← L

⋃{(pi,pj)}

21: end if22: end for23: for all rik not matched to another ray do24: Find the intersection of rik and image boundary qb

25: if ‖xi − qb‖ ≤ 0.05×m then26: P ← P

⋃{pi, qb}, L← L

⋃{(pi, qb)}

27: else28: Find the farthest point qM along rik onM29: Find all intersections {q1, . . . , qS} of (pi, qM)

with segments in L30: q0 ← pi, qS+1 ← qM31: for s ∈ {1, 2, ..., S, S + 1} do32: if κ(qs−1, qs) > 0.6 then33: P ← P

⋃{qs−1, qs}

34: L← L⋃{(qs−1, qs)}

35: end if36: end for37: end if38: end for

Going deeper. It is also interesting to investigate howthe network depth of the encoder affects the performance.In this experiment, we compared two different choicesbased on Google Inception-v2, namely the first layer to“Mixed 3b”, and the first layer to “Mixed 4b”. Note thatthe latter has a larger depth and receptive field, at the costof spatial resolution (30×30). As one can see in Figure 1(b),

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Recall

00.10.20.30.40.50.60.70.80.91

Precision

rmax=1rmax=7rmax=1

(a) rmax

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Recall

00.10.20.30.40.50.60.70.80.91

Precision

Mixed-3bMixed-4b

(b) Network depthFigure 1. Experiment on junction detection network parameters.

increasing the depth (i.e., predicting at the “coarser” level)results in higher precision but lower recall. This sug-gests possibilities to further improve the performance of ourmethod using a “skip-net” architecture, that is, combiningpredictions at multiple levels. We leave this for future work.

2.2. Experiment on Line Segment Detection

In this experiment, we study the possibility of extract-ing line segments directly from the pixel-wise line heat mappredicted by our network (i.e., without using junctions). Tothis end, we simply perform a probabilistic hough trans-form [1] on the line heat map to generate line segments.We compare the results with LSD, MCMLSD, and our fullwireframe construction method.

Figure 2 shows the precision-recall curves of all meth-ods. We make the following observations on the results:First, the performance of our “Heatmap + Hough” ap-proach is comparable to that of the state-of-the-art line seg-ment detection method MCMLSD, verifying the effective-ness of the our line detection network. Second, by com-bining the predicted junctions with the line heat map, ourfull wireframe construction method performs significantlybetter than using the line heat map alone. This further il-lustrates the importance of junction detection in parsing thewireframe: By detecting the “endpoints” of the line seg-ments, we effectively overcome the difficulties faced by tra-

2

Page 3: Supplementary Materials for Learning to Parse Wireframes ... · 1. Wireframe Construction Algorithm Detail Given an image, our wireframe construction algorithm takes a set of junctions

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Recall

00.10.20.30.40.50.60.70.80.9

1

Prec

isio

n

LSDMCMLSDOur MethodHeatmap+Hough

(a) Our test dataset

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Recall

00.10.20.30.40.50.60.70.80.9

1

Prec

isio

n

LSDMCMLSDOur MethodHeatmap+Hough

(b) York Urban datasetFigure 2. Experiment on line segment detection.

ditional line segment detection methods, including the falsedetection problem and the inaccurate endpoint problem.

2.3. Additional Results on Junction Detection

In Figure 4, we show additional junction detection re-sults obtained by all methods. One can see that our methodis able to detect most junctions and their branches in the im-age, achieving superior performance over existing methods.

From Figure 4 we can also observe some limitations ofour method. Specifically, there are occasionally repeateddetections in our result. This may be caused by junctionslocated at the boundary of two adjacent grid cells used inour junction detection network. Similarly, the use of gridcould also lead to missed detection if two junctions are veryclose to each other. But we note that such cases are ratheruncommon in practice and have very small effect on theoverall scene structure estimation.

2.4. Additional Results on Wireframe Construction

In Figure 5, we show additional wireframe detection re-sults obtained by all methods. Our method outperformsother two in most areas and produces much cleaner resultsas we focus on long line segments and exploit their relations(junctions). Therefore, the resulted wireframes are poten-tially more suitable for 3D reconstruction tasks.

In Figure 3, we further show some failure cases of our

Figure 3. Failure cases on our test dataset. First row: Our method.Second row: Ground truth.

method. One challenging case corresponds to structureswith relatively small scale and weak image gradients (e.g.,the stairs in the first image). Also, our method sometimeshas difficulty in image region of repetitive patterns (e.g.,the handrails in the second image and the brick wall in thethird image), generating fragment, incomplete results. Thissuggests opportunities for further improvement by explic-itly harnessing such geometric structure in our wireframeconstruction.

References[1] J. Matas, C. Galambos, and J. Kittler. Robust detection of lines

using the progressive probabilistic hough transform. Com-puter Vision and Image Understanding, 78(1):119–137, 2000.2

3

Page 4: Supplementary Materials for Learning to Parse Wireframes ... · 1. Wireframe Construction Algorithm Detail Given an image, our wireframe construction algorithm takes a set of junctions

Figure 4. Junction detection results. First row: MJ (dmax = 20). Second row: ACJ (ε = 1). Third row: Our method (τ = 0.5).

4

Page 5: Supplementary Materials for Learning to Parse Wireframes ... · 1. Wireframe Construction Algorithm Detail Given an image, our wireframe construction algorithm takes a set of junctions

Figure 5. Line/wireframe detection results. First row: LSD (-log(NFA) > 0.01× 1.758). Second row: MCMLSD (confidence top 100).Third row: Our method (line heat map h(p) > 10). Fourth row: Ground truth.

5