Wireframe Parsing With Guidance of Distance Map...K. Huang, S. Gao: Wireframe Parsing With Guidance of Distance Map FIGURE 2. The overall architecture of our end-to-end Wireframe parser

Received August 2, 2019, accepted August 25, 2019, date of publication September 26, 2019, date of current version October 9, 2019.

Digital Object Identifier 10.1109/ACCESS.2019.2943885

Wireframe Parsing With Guidanceof Distance MapKUN HUANG 1,2,3, (Student Member, IEEE), AND SHENGHUA GAO 2, (Senior Member, IEEE)1Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 200050, China2School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China3University of Chinese Academy of Sciences, Beijing 100049, China

Corresponding author: Kun Huang ([email protected])

ABSTRACT We propose an end-to-end method for simultaneously detecting local junctions and globalwireframe in man-made environment. Our pipeline consists of an anchor-free junction detection module,a distancemap learningmodule, and a line segment proposing and verificationmodule. A set of line segmentsare proposed from the predicted junctions with guidance of the learned distance map, and further verifiedby the proposal verification module. Experimental results show that our method outperforms the previousstate-of-the-art wireframe parser by a descent margin. In terms of line segments detection, our method showscompetitive performance on standard benchmarks. The proposed networks are end-to-end trainable andefficient.a

INDEX TERMS Artificial neural networks, computer vision, feature extraction, image edge detection.

I. INTRODUCTIONInferring 3D geometric information of a scene from 2Dimages has been a fundamental yet difficult problem incomputer vision. For a long time, researchers model andreconstruct 3D scenes by extracting and matching local fea-tures (e.g.SIFT features, corners and patches). The feasibilityof incorporating these local features by matching or track-ing has been demonstrated by numerous works. Meanwhile,modern scenarios which often involve complex interactionsof autonomous agents (UAV, car, home robot) with clutteredman-made environments (indoor or outdoor) present greaterchallenge for conventional local features based approaches.Specifically, the challenges lie in: Man-made scenes oftenconsist of large areas of textureless surfaces; there may existareas of repetitive patterns (facades e.g.) which cause muchambiguity for matching. Therefore, it is crucial for visionsystems to capture more global features to analyze the scenesmore accurately.

To cope with these challenging environments, prior knowl-edge about the scene is exploited to relax the problem inmany works. For instance, Manhattan world assumption[1]–[5] has proven to benefit the 3D reconstruction tasks.However, the Manhattan world assumption is often violated

The associate editor coordinating the review of this manuscript and

approving it for publication was Chao Shen .aThe code will be released on github for reproduction of the results.

in cluttered man-made environments. Fortunately, indepen-dent of the assumption, we can formulate such environmentsas the ‘‘wireframe’’ defined in [6]. The wireframe [6] consistsof a subset of salient lines and junctions in the scene. Concep-tually, such junctions and lines could just be a subset amongall the local corner and edge features extracted by traditionalmethods, but they encode rich information about the large-scale geometry of the scene, which is the key to understandthe scenes more globally.

Detection of line segments [7] and junctions [5], [8] in 2Dimages has been studied in previous works. The detectionresults are further used to recover the geometric of the scenesin [2], [4], [9]–[11]. Typically, these methods rely on low-level cues and involve various heuristics, including searchof appropriate thresholds, RANSAC-based verification tech-niques etc., which limit its scalability in new scenarios.

With the success of deep learning in computer vision,more and more approaches based on deep learning have beenproposed to tackle vision tasks. Specifically, in the literatureof line segment and junction detection, several methods havebeen proposed to detect line segments [12] or estimate thewireframe [6] in the image. In [12], a line attraction filedmap (AFM) method is proposed to detect line segment bylearning an embedding vdir for each pixel p in the image.The vdir encodes the direction (with length) to the line seg-ment pixel which is the closest to p. In the post-processingstage, the line segment candidate pixels are found by adding

141036 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ VOLUME 7, 2019

https://orcid.org/0000-0003-0797-8338

https://orcid.org/0000-0003-1626-2040

https://orcid.org/0000-0002-6959-0569

K. Huang, S. Gao: Wireframe Parsing With Guidance of Distance Map

vdir to p. These candidate pixels are grouped into line seg-ments using greedy algorithm. AFM shows better perfor-mance than previous state-of-the-art line segment detectionmethods. However, AFMonly detects line segments, ignoringthe intersecting relationships between pairs of them, which isimportant for geometric reasoning.

Previous to AFM, the Wireframe Parser [6] formulateswireframe parsing as finding the salient line segments {l i} andthe intersections between these segments, i.e.junctions {pi}in the images. The concept of wireframe is more completethan pure set of line segments, and the wireframe obviouslyencodes richer geometric information. The parser networkin [6] consists of two sub-networks for junction and linedetection, respectively. The junction sub-network divides theimage into K × K grid, and detect at most one junction ineach grid cell by regressing the confidence of the junctioncell. Besides, the residual of the junction off the grid cellcenter is regressed in the meanwhile. The line detection sub-network estimates the image pixels which are on long straightlines. Finally, the detected junctions and estimated line pixelsare combined into holistic wireframe with a post-processingprocedure. Wireframe Parser captures some important geo-metric information of the scene. However, there are severaldrawbacks that limit the performance of the parser:

• The junction decoder predicts fixed number of junc-tion (i.e.1) in each grid cell. Considering the unevendistribution of junctions in scenes, the assumption thatthere exists only one junction in a single grid cell doesnot apply to scenes with dense junctions, such as imagesof facades, urban street views.

• The process of proposing line segment from junctionssuffers from the inaccurate junction predictions. Espe-cially, the inaccurate junction branches make it difficultto pair junctions.

• The proposed line segments are not further verified toreject false connections between junctions.

Based on the above discussions of previous works on linesegment detection andwireframe parsing, we propose an end-to-end network for parsing the wireframe of man-made envi-ronments. Overall, our network contains two stages. In thefirst stage, our network learns three per-pixel embeddingmaps, including junction heatmap, junction branch map, anddistance map. Then in the second stage, a set of line segmentproposals are proposed from the junctions with guidance ofthe distance map, and further verified through a verificationnetwork.

As depicted in Fig. 2, with an image fed into thenetwork, the junction detection module outputs a junc-tion heatmap (heatmap decoder) and a junction branchmap (branch decoder). Junctions are obtained by picking thelocations of high density over the junction heatmap, thusour network detects junctions in an anchor-free manner, andis able to detect unlimited number of junctions. Meanwhilethe distance decoder outputs a distance map. Examples ofdistance map is shown in Fig. 1. Then we propose a set

FIGURE 1. Second column: The pre-computed ground-truth for distancemap learning; Third column: The converted distance map. We set an arrayof thresholds {bi , i = 1, 2, . . . , 5} for the log-normalized distance value,in which b1 = 0, b5 = ∞. If the distance value dn

p of pixel p falls into theith bin [bi , bi+1), it is visualized with the ith color.

of line segments using our Least Distance algorithm fromthe predicted junctions guided by the learned distance map.Finally, the proposals are passed to a verification module andeach assigned a confidence score.Contributions of this work include: (i) We propose an

end-to-end learning network for wireframe parsing in theimages of man-made environments, consisting of an anchor-free junction detection module, a distance learning mod-ule and a line segment proposing and verification module.(ii) Our method outperforms the previous wireframe parserby a descent margin, and is more efficient. (iii) Comparedwith the previous wireframe parser, our method producesmuch cleaner wireframe results with guidance of the learneddistance map. (iv) In terms of line segment detection, ourmethod is close to the state-of-the-art line segment detectionmethod (AFM [12]).

II. RELATED WORKA. JUNCTION DETECTION AND KEYPOINT DETECTIONJunctions played an important role in computer vision,however, detecting junctions in images remains a challengingproblem. Typical local methods such as Harris operator [14],are based on 2D variation in the intensity signal, and it is weakat handling textured regions. More recent methods find junc-tions in natural images [15] by studying contour curvature.Some others group line segments [5], [8], [9] to find junc-tions (i.e.the intersections of segments). As psychophysicalanalysis suggests that local junction is difficult to detect, evenfor humans [16].

Keypoint detection is closely related to some reconstruc-tion tasks, such as reconstruction of scenes, faces and humanposes. Many recent approaches for keypoint estimation applydeep neural networks (DNNs). Earlier works formulate key-point estimation as a location (keypoint coordinates) regres-sion problem [17]. In later works, keypoint detection meth-ods mostly rely on regressing a pre-computed keypointheatmap [18], [19]. The heatmap can be easily converted

VOLUME 7, 2019 141037


FIGURE 2. The overall architecture of our end-to-end Wireframe parser. After the first few layers, the input image is downsampled to H8 ×

W8 . Fn is the

mid-layer feature generated by the nth hourglass module. And the number of channel of all Fn is 256 in our experiments. We use 3 hourglass modules intotal, i.e.n = 3. The decoders of distance, heatmap and branch are a set of convolution layers of the same layer configuration. In proposal verificationmodule, we perform RoI-Align [13] on F3.

to keypoint coordinates by searching the local maximums.Junctions can be naturally considered as geometric keypoints.Huang et al. [6] train a anchor-based junction detector usingdeep neural networks on a large-scale dataset with junctionannotation, and achieve the state-of-the-art results. Recently,Zhou et al. [20] propose to find junctions in an end-to-endmanner.

B. PIXEL-LEVEL EDGE DETECTION ANDLINE SEGMENT DETECTIONLine segments detection has long been studied. Conven-tional methods typically rely on grouping low-level cues[7], [21]–[24]. These approaches suffer the pain of search-ing an appropriate threshold to filter out false detections.Some other work extends Hough transform to line segmentsdetection [25]–[27]. In recent years, machine learning espe-cially deep learning based approaches have been shown toproduce the state-of-the-art results in generating pixel-wiseedge maps [28]–[31]. Huang et al.propose to parse wireframein man-made scenes. The wireframe defined in [6] consistsof a subset of salient line segments and the intersectionsbetween them in scenes. And they build a large-scale datasetfor training. Each image in the dataset is annotated withline segments and their intersecting relationship. Follow-ing Wireframe Parser, Xue et al. propose a line attractionfield network [12] for line segment detection and achievethe state-of-the-art line segment detection results. Unlikethe Wireframe Parser, they only focus on detection of linesegments, ignoring the relationship between the segments.Zhang et al. [32] formuate the wireframe parsing task as agraph optimization problem.

C. LEARNING THE GEOMETRYWith the success of deep learning, learning based approacheson inferring pixel-level geometric properties of scenes havebeen developed for several years. Geometric properties, suchas the depth [33], [34], and the surface normal [35], [36]have been extensively studied. Recently, more and morework pays attention to the higher-level geometric primitives.

For instance, [37] proposes a method to recognize planes ina single image, [38] uses SVM to classify indoor planes,and [39]–[41] train fully convolutional networks to predictroom layout edges formed by the pairwise intersections ofroom faces. Liu et al. propose PlaneNet [42] to detect piece-wise planar regions from a single image. Yang and Zhou [43]recover planar regions with deep networks.

D. EMBEDDING LEARNING FOR DETECTIONEmbedding learning are frequently used in instance segmen-tation. Usually the embedding is learned by regressing thepre-computed embedding map generated from the ground-truth annotations and acts as the cue to cluster pixels orgroup keypoints into instances. For example, Liang et al.useproposal-free networks (PFN) [44] to handle instance seg-mentation task. PFN take the coordinates of the center, top-left corner and bottom-right corner of the object instance thata specific pixel belongs to as the target embedding, whichis used to cluster segmentation results into object instances.Bai and Urtasun [45] apply deep networks to learn a distanceembedding, which measures how close a pixel in an instanceobject is to the boundary of the instance. The network outputsan embedding map with peaks and basins. Watershed trans-form is then used in post-processing to find the basins, i.e.theinstance boundaries. Papandreou et al. [46] propose to learnpose keypoints heatmap along with the displacement of eachkeypoint to other keypoints. Recently, Xue et al. [12] proposeline attraction field networks to learn an embedding for linesegment detection. The line attraction embedding encodesthe direction (with length) of each pixel to the closest linesegment pixel, and is post-processed with the squeezemoduleto find line segments.

III. METHODFor clarity of notation, we let p and x denote an image pixeland a junction, respectively. And we denote a line segment byl. In addition, superscript ‘g’ and ‘p’ represent ground-truthand predictions, respectively. In terms of distance functions,

141038 VOLUME 7, 2019


FIGURE 3. The shortest distance of a point x to a line segment l = (xs, xe).The projection of the point on the line segment is (a) inside the line,d (x, l) = d⊥(x, l), the perpendicular distance of x from l, (b) outside theline, d (x, l) = min(d (x, xs), d (x, xe)), i.e.the smaller euclidean distancebetween x from xs and xe.

d(xi, xj) is the euclidean distance between two junctions. Andd⊥(x, l) defines the perpendicular distance of junction x toline segment l, while d(p, l) denotes the shortest distance ofpixel p to line segment l, as illustrated in Fig. 3. The projec-tion of junction x on line segment l is denoted by ρ(x, l). Inthe distance map, we use dp to represent the shortest distanceof pixel p to any line segment pixel in the image, and dnp todenote the log-normalized distance value.

A. ANCHOR-FREE JUNCTION DETECTIONJunction acts as an important role in our line segment propos-ing algorithm. We follow the definition and representationin [6]. However, we made some modifications to make therepresentation more convenient.

1) JNCTION HEATMAP REGRESSIONIn [6], the junction sub-network divides the image into K×Kgrid, and detect at most one junction in each grid cell. Thecell centers act as a kind of anchor commonly used in objectdetection methods [47]. And the use of anchor is based on theassumption that there exists at most one junction in each cell.However, the assumption does not apply to some scenes withdense junctions.

As mentioned in Section II, heatmap regression hasbecome a common practice in keypoint detection tasks, suchas human pose estimation [18]. By applying heatmap, esti-mation of the keypoint coordinates is transformed to spatialdensity regression. Inspired by this, we detect junctions byfirst regressing a junction heatmap. The ground-truth of junc-tion heatmap is generated by placing a Gaussian probabilitydistribution with a radius of 11 (the image size is 384) cen-tered at each junction x. As is shown in Fig. 2, the junctionheatmap decoder outputs a H × W junction heatmap. Thenlocations with local maximumdensity on the learned heatmapare extracted as junction candidates through non-maximumsuppression (NMS).1 After NMS, we obtain a set of junctioncandidates {(xi, ci), i = 1, 2, . . . ,Nmax

}. Nmax is the maxi-mum number (pre-set value) of junctions in an image. ci isheatpmap density value at xi. Overall, the proposed junctiondetection method shows more flexibility without relying onany anchors or assumptions.

1Here, the NMS is a simple max-pooling operation plus a filtering of non-maximum values.

FIGURE 4. The representation of junction branches, adopted from [6]. If abranch falls into the kth bin, then it is represented with (k, 1k ). 1k if theoffset of the branch from the bin center (the black dashed line) inclock-wise direction. Note that 1k is normalized to [−1, 1).

2) JUNCTION BRANCH DEFINITION AND REPRESENTATIONFor junction branches, we adopt the same multi-bin repre-sentation proposed in [6]. As depicted in Fig. 4, the circle(i.e., from 0 to 360 degrees) is divided into K equal bins, witheach bin spanning 360

K degrees. Let the center of the k-th binbe bk , then an angle θ is represented as (k,1k ), if θ is locatedat the k-th bin, where 1k is the angle offset from the centerbk in the clockwise direction. Thus, each valid bin (e.g.existsa junction branch) will regress to this local orientation 1k .We normalized1k to be in [−1, 1) in all our experiments. Foreach predicted junction location (from heatmap), the branchdecoder in Fig. 2 outputs a 2K vector, i.e.confidence scoreand angle offset for K bins.

However, the junction branch defined in [6] is incompletefor proposing line segments. Geometrically, junction is theintersection of n (n ≥ 2) line segments, that is, each junctionshould have at least two branches. However, the endpointsof some line segments lie on the image boundary (the lineis cut off by the image boundary), or some curve lines. Bothcases are not considered in the previous definition. Hence weextend the junction definition to the union of intersections ofline segments and isolated endpoints of line segments (e.g.onthe image boundary or none-straight lines). Then each junc-tion contains n (n ≥ 1) branches. This definition is moreflexible, and more suitable for finding line segments fromjunctions.

In [6], junctions locations and branches are predicted sep-arately, we similarly use two decoders to predict junctionheatmap (1 × H × W ) and branch map (2K × H × W ),respectively.

3) MATCHING KEYPOINTS WITH GROUND-TRUTHJUNCTIONSA set of junction candidates {(xi, ci), i = 1, 2, . . . ,Nmax

}

are extracted from the junction heatmap, which is the outputof heatmap decoder in Fig 2. Those with confidence scoreabove a threshold τJ are kept as junction predictions. For ajunction prediction xpi , we first find the ground-truth junctionxgj which is closest to xpi . If d(x

pi , x

gj ) < τM , where τM is a

threshold, thenwe assign the branches of xgj to xpi as its branch

map regression target. Intuitively, τM is set to be half size ofthe Gaussian kernel used in generating the ground-truth ofjunction map.

VOLUME 7, 2019 141039


B. DISTANCE MAP LEARNINGThe shortest distance of a point to a line segment is depictedin Fig. 3. For an image, the distance map D is computedby choosing the smallest among all distances {d(p, li), i =1, 2, . . . ,NL

} for each pixel p, where NL is the total numberof ground-truth line segments in the image. As mentionedpreviously, the embedding learned in instance segmentationis exploited to group pixels into instances in post-processing.In our case, the distance map is designed to filter out non-line textures and encodes a value for each pixel. Examples ofground-truth distance map are presented in Fig. 1. We usethe log-normalized distance dnp = log(dp + 1) as the targetfor distance map learning, since dp can be very large. Anddnp ∈ [0, log(

√H2 +W 2 + 1)].

C. PROPOSING LINE SEGMENTS FROM JUNCTIONSGUIDED BY DISTANCE MAPFor an image I , our line segment proposing algorithm takesthe distance map D and the predicted junctions X =

{(xpi , θi1, . . . , θiKi ), i = 1, 2, ..,N } (θik is the kth branch ofjunction xpi .) as input. Instead of striving to pair junctionsaccording to their branches as in [6], we exploit our LeastDistance algorithm to generate a set of line segment proposalsP = {(xpi , x

pj )}.

Let rik denote the ray formed by junction xi and branch θik ,and ψ(xi, xj, rik ) be the angle difference between vector −→xixjand the ray rik . The Least Distance algorithm takes followingsteps.• At first, we initialized an empty candidate set Ci andan empty proposal set P. Ci stores the line segmentcandidates starting from xpi .

• For kth branch θik ∈ [0, 360) of xpi , for all other N − 1junctions, if ψ(xpi , x

pj , rik ) < τP, we add line (xpi , x

pi ) to

Ci.• For a single candidate in Ci, we extract a fixed lengthof distance value from D using our modified versionof RoI-Align Module [13]. Specifically, for candidate(xpm, x

pn), we sample the distance values from S equidis-

tant sampling points starting from xpm to xpm, and sum upthe sampled values.

• Among all candidates in Ci, we choose the one xp∗ with

least summed distance value (least distance indicateshigher confidence of line segment), and add (xi, x

p∗) to

P.• P is the set of line proposals.

Overall, our Least Distance algorithm proposes O(N ) linesegments for an image with N junctions. Note that the sam-pling of distance value has efficient GPU implementations.

D. PROPOSAL VERIFICATION NETWORK1) ASSIGN LABEL TO PROPOSALSFor an image I , we obtain a set of proposals P = {(xpi , x

pj )}

with the Least Distance algorithm. To further verify theseproposals, wematch the proposed line segments with ground-truth line segments and assign each proposal a label 0 (notmatched) or 1 (matched).

Let ϕ denote the angle difference between two lines,∀l i, l j, ϕ(l i, l j) ∈ [0, 180). We match a proposal lp = (xpi , x

pj )

in Pwith a ground-truth line segment lg = (xgk1, xgk2) if either

of the two conditions below is satisfied.

(1) Condition 1: d(xpi , lg) < τd , d(x

pj , l

g) < τd .(2) Condition 2: ϕ(lg, lp) < τθ ,

d⊥(xpi , lg) < τd , d⊥(x

pj , l

g) < τd ,ρ(xpi , l

g) = t ∗ xgk1 + (1 − t) ∗ xgk2, t ∈ [−τ⊥, 1 + τ⊥],and same requirement for xpj .

where τd , τθ and τ⊥ are pre-set thresholds. As mentionedearlier, ρ(xpi , l

g) is the projection of junction xpi onto line seg-ment lg. For a proposal lp = (xpi , x

pj ), if there exists a ground-

truth line segment lg satisfying condition 1 or condition 2,then lp is assigned the label 1, otherwise 0.The condition 1 can cover most cases, and condition 2 is

to handle the situation where a ground-truth line segmentwas not annotated with accurate length (a little shorter orlonger). The tolerance threshold τ⊥ is set to 0.2 in all ourexperiments.

2) PROPOSAL VERIFICATIONIn the proposal verification module, features for all the linesegment proposals are extracted by performing the modifiedRoI-Align onFn (feature layer after the nth hourglass module)and fed into a proposal classification network, which out-puts a confidence score for each proposal. As to RoI-Align,we first sample S equidistant sampling points from the start-ing point to the ending point of a line segment, and samplevalues on the feature map Fn (shown in Fig. 2) at thosesample points. Note that the original RoI-Align Module [13]is intended for extracting features of bounding box areas,hence we modified the module to extract features for linesegments.

To align the features with the line segment, our modifiedRoI-Align module samples feature values by bilinear inter-polation to get sub-pixel features. Assume the number ofchannels of Fn is C , the extracted feature map for each linesegment proposal is of shape S × C .

E. NETWORK ARCHITECTUREOur network is typical encoder-decoder style, as shownin Fig. 2. The first few layers are convolution and pool-ing layers, which downsample the input to 1/8 of originalresolution. Then n (n = 3 in our experiment) hourglassmodule [18] are stacked to get the feature maps {Fi, i =1, 2, . . . , n}. The following decoders can take any Fi as input,and we used F3 as input for decoders as shown in Fig. 2.Each decoder consists of a set of convolution and upsamplinglayers, followed by a 1 × 1 convolution layer to get thedesired embedding maps. For simplicity, all the decodersshare the same layer configuration. To classify the propos-als, the proposal verification module consists of two con-volution layers and one fully connected layer. Note thatall convolution layers are followed by batch normalizationlayer.

141040 VOLUME 7, 2019


FIGURE 5. The precision-recall curves of different line segment detection methods. Left: Wireframe dataset [6] Right: The YorkUrban dataset [49].

F. LOSS FUNCTIONOur loss function consists of the regression loss of junctionheatmap LJh , distance map LD, the junction branch confi-dence LJbc, junction branch residual (offset) L

Jbr , and the line

segment proposal confidence LLprop. We use the binary crossentropy (BCE) loss for LJh , i.e.

LJh = BCE(Jp, Jg) (1)

where Jp and Jg refers to the prediction and ground-truthjunction heatmap, repectively. Note that we adopt the weight-ing scheme of focal loss [48] to combat the imbalance of zeroand non-zero value on the heatmap. And

LD =∑p∈I

1

dpp||dpp − d

gp ||

22 (2)

where we weight the loss of each pixel pwith dpp , since pixelsfar from lines result in much larger loss if not weighted.

For junction branches, as mentioned above, we firstmatched every predicted junctions xpi ∈ X to the ground-truth junctions G = {xgi }. We define a mapping function f (i)to represent the index of the ground-truth junction which ismatched with the predicted junction i. And ci and ri are bothK -dimension (K is the number of bins) vector representingthe branch confidence and residual of junction xi, respec-tively. Assume there are N junction predictions in X , and

LJbc =N∑i=1

BCE(cpi , cgf (i)) (3)

LJbr =N∑i=1

||rpi − rgf (i)||

22 (4)

As mentioned in Section III-D, each line segment proposallpk is assigned a label gk . Let cls(l

pk ) denote the confidence

score output by the proposal classification network, then

LLprop =Nprop∑k=1

BCE(gk , cls(lpk )) (5)

TABLE 1. Comparison of inference speed.

LJh and LD are normalized by the number of pixels in I .And Lbc and Lbr are normalized by the number of matchedjunction predictions. And Lprop is normalized by the totalnumber of line segment proposals. Then our loss function is

L = wJLJh + wDLD + wbcLJbc + wbrL

Jbr + wpropL

Lprop (6)

wherewJ ,wD,wbc,wbr ,wprop are weights to balance the sub-losses.

G. TRAININGIn the training and testing phase of our network, images areresized to 384 × 384 before input. And we follow the dataaugmentation strategy in [6], includingmirroring and flippingimages upside-down. The training of our network consists oftwo stages. In the first stage, we set wprop to zero in the first25 epochs, since there are not enough proposals for classi-fication in the early stage of training. At the second stage,we resume the training of Lprop and train 5 more epochs.2 Wetrain our network with the stochastic gradient descent (SGD)optimizer withmomentum 0.9, initial learning rates 0.253 anda batch size of 4.

IV. EXPERIMENTSWe evaluate the proposed method and make a compari-son with the previous state-of-the-art wireframe parser [6].4

Also we compare our line detection results with the

2Weights of other losses are not changed.3Note that our learning rate is large due to the scale of our loss is small4https://github.com/huangkuns/wireframe

VOLUME 7, 2019 141041

https://github.com/huangkuns/wireframe


FIGURE 6. Wireframe detection results. From left to right, LSD (− log(NFA) > 0.01 ∗ 1.758), MCMLSD (top 90),AFM (aspect ratio < 0.2), wireframe parser (heatmap > 10), our method (confidence > 0.3), ground-truth.

state-of-the-art line detectors, namely the Line SegmentDetector (LSD) [7]5 and the Markov Chain Marginal LineSegment Detector (MCMLSD) [50],6 and the recent Line

5http://www.ipol.im/pub/art/2012/gjmr-lsd/6http://www.elderlab.yorku.ca/resources/

Attraction Field (AFM) [12].7 We test and compare withthese methods on Wireframe dataset [6] and the York Urbandataset [49].

7We use the results pre-computed by the author here.

141042 VOLUME 7, 2019

http://www.ipol.im/pub/art/2012/gjmr-lsd/

http://www.elderlab.yorku.ca/resources/


A. DATASETSThe Wireframe dataset [6] contains 5462 images of indoorand outdoor man-made environments. Each image is anno-tated with a set of junctions and lines. We follow the splitin [6], i.e.5000 images for training and validation, 462 imagesfor testing. On York Urban dataset, we simply use the modeltrained on Wireframe dataset to evaluate the performance.

B. EVALUATION METRICSIn all our experiments, the precision and recall as describedin [6], [51] are reported. Specifically, precision and recallare computed by comparing the line segment pixels mapof the prediction and ground-truth. The precision measuresthe proportion of true detections among all the detected linesegments, whereas recall indicates the percentage of truedetections among all the ground-truth line segments in theimage.

C. COMPARISONS WITH OTHER METHODSThe proposed method is compared with the previousWireframe Parser [6], the Line Segment Detector (LSD) [7]and the Markov Chain Marginal Line Segment Detec-tor (MCMLSD) [50], and the Line Attraction Field(AFM) [12].

These methods filter out false detections with dif-ferent thresholds. We set an array of thresholds for− log(NFA) (NFA is the number of false alarms) in LSDimplementation, i.e.0.01 ×

{1.750, . . . , 1.7519

}. And for

MCMLSD, we select top K (in terms of confidence) linesegment detections for comparison. For AFM, aspect ratiois used to filter out false detections, and it varies in therange (0, 1] with a step size of 0.1. The Wireframe Parserrejects false detections by applying an array of thresholds[2, 6, 10, 20, 30, 50, 80, 100, 150, 200, 250, 255] to binarizethe line heatmap, while keeping the threshold of junction con-fidence and junction branch confidence fixed. In the proposedmethod, we simply vary the threshold of the confidence ofline segment detections in the range (0, 1] with a step sizeof 0.1 to pick true detections.

The precision/recall curves are presented in Fig. 5. As isshown, our method outperforms the previous WireframeParser by a descent margin onWireframe dataset, and is closeto the Wireframe Parser on York Urban dataset. It is worthnoting that the York Urban dataset is a small dataset aimedforManhattan lines estimation, and some line segments in theimages are not annotated. Hence the performance of almostall approaches significantly drops on this dataset. Comparingwith the state-of-the-art line segment detectors, our methodoutperforms LSD and MCMLSD, and is close to the recentAFM on Wireframe dataset.

All our experiments are conducted on one NVIDIA TitanX GPU device. We compare the inference speed of all themethods mentioned above in Table. 1. As is shown, ourmethod ismuch faster than the previousWireframe Parser [6],since it does not require any post-processing step. And ourmethod runs as quickly as the AFM-unet [12].

D. QUALITATIVE EVALUATION OF RESULTSThe results of all aforementioned methods are visualizedin Fig. 6. For the proposed method and the Wireframe Parser,the junctions (yellow) and line segments (green) are displayedtogether. Since the line segment detection methods do notoutput junctions, we simply treat the endpoints of the linesegments as junctions and present the ‘‘junctions’’ and theline segments.

As is shown, LSD [7] and MCMLSD [50] are sensitiveto local textures, and produces many short line segments.AFM [12] greatly improves this limitation by suppressingthe interference of local textures. However, AFM still yieldssome short line segments. The Wireframe Parser furtherreduces the effect of local textures by defining junctionsexplicitly and then establishing connections between junc-tions. Unfortunately, some wrong connections are found dueto the lack of means to reject wrong connections. Comparedwith the Wireframe Parser [6], our method produces muchcleaner wireframe. Specifically, our method works betterat suppressing wrong connection between junctions on twoaspects. On one hand, the learned distance map guides theestablishment of connections between the detected junctions,according to theLeast Distance algorithm. On the other hand,the line proposal verificationmodule explicitly suppresses thewrong line proposals.

In terms of line segment detection, our method is close toAFM [12], the state-of-the-art line segment detector. Differ-ent from AFM, our network yields both the line segments andthe intersections between pairs of them.

V. CONCLUSIONIn this paper, we propose an end-to-end wireframe pars-ing method based on the previous state-of-the-art wireframeparser. The proposed parser consists of an anchor-free junc-tion detection module, a distance map learning module anda line segment proposing and verification module. Line seg-ments are proposed from junctions with guidance of distancemap, and further verified by the verification module. Thenetwork is end-to-end trainable and efficient. Experimentalresults show that our method outperforms the previous wire-frame parser, and is close to the recent AFM [12] in termsof line segments detection. There is still plenty of room forimprovement, and the end-to-end pipeline still can be furtheroptimized. Considering the difficulty of the task and the greatvariety in the dataset, one future direction is to generatelarge-scale of synthetic wireframe data to help training bettermodel.

REFERENCES[1] E. Delage, H. Lee, and A. Y. Ng, ‘‘Automatic single-image 3D reconstruc-

tions of indoor manhattan world scenes,’’ in Proc. Int. Symp. Robot. Res.,2005, pp. 305–321.

[2] D. C. Lee, M. Hebert, and T. Kanade, ‘‘Geometric reasoning for singleimage structure recovery,’’ in Proc. CVPR, Jun. 2009, pp. 2136–2143.

[3] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski, ‘‘Manhattan-worldstereo,’’ in Proc. CVPR, Jun. 2009, pp. 1422–1429.

[4] A. Flint, D. Murray, and I. Reid, ‘‘Manhattan scene understandingusing monocular, stereo, and 3D features,’’ in Proc. ICCV, Nov. 2011,pp. 2228–2235.

VOLUME 7, 2019 141043


[5] S. Ramalingam, J. K. Pillai, A. Jain, and Y. Taguchi, ‘‘Manhattan junc-tion catalogue for spatial reasoning of indoor scenes,’’ in Proc. CVPR,Jun. 2013, pp. 3065–3072.

[6] K. Huang, Y. Wang, Z. Zhou, T. Ding, S. Gao, and Y. Ma, ‘‘Learning toparse wireframes in images of man-made environments,’’ in Proc. CVPR,Jun. 2018, pp. 626–635.

[7] R. G. von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall, ‘‘LSD: A fastline segment detector with a false detection control,’’ IEEE Trans. PatternAnal. Mach. Intell., vol. 32, no. 4, pp. 722–732, Apr. 2010.

[8] G.-S. Xia, J. Delon, and Y. Gousseau, ‘‘Accurate junction detection andcharacterization in natural images,’’ Int. J. Comput. Vis., vol. 106, no. 1,pp. 31–56, Jan. 2014.

[9] A. Elqursh and A. Elgammal, ‘‘Line-based relative pose estimation,’’ inProc. CVPR, Jun. 2011, pp. 3049–3056.

[10] S. Ramalingam and M. Brand, ‘‘Lifting 3D manhattan lines from a singleimage,’’ in Proc. ICCV, Dec. 2013, pp. 497–504.

[11] H. Yang and H. Zhang, ‘‘Efficient 3D room shape recovery from a singlepanorama,’’ in Proc. CVPR, Jun. 2016, pp. 5422–5430.

[12] N. Xue, S. Bai, F. Wang, G.-S. Xia, T. Wu, and L. Zhang, ‘‘Learningattraction field representation for robust line segment detection,’’ in Proc.CVPR, Jun. 2019, pp. 1595–1603.

[13] J. Dai, Y. Li, K. He, and J. Sun, ‘‘R-FCN: Object detection via region-based fully convolutional networks,’’ in Proc. Adv. Neural Inf. Process.Syst., 2016, pp. 379–387.

[14] C. G. Harris and M. Stephens, ‘‘A combined corner and edge detector,’’ inProc. Alvey Vis. Conf., 1988, pp. 1–6.

[15] M. Maire, P. Arbelaez, C. Fowlkes, and J. Malik, ‘‘Using contours todetect and localize junctions in natural images,’’ in Proc. CVPR, Jun. 2008,pp. 1–8.

[16] J. McDermott, ‘‘Psychophysics with junctions in real images,’’ Perception,vol. 33, no. 9, pp. 1101–1127, 2004.

[17] A. Toshev and C. Szegedy, ‘‘DeepPose: Human pose estimation via deepneural networks,’’ in Proc. CVPR, Jun. 2014, pp. 1653–1660.

[18] A. Newell, K. Yang, and J. Deng, ‘‘Stacked hourglass networks for humanpose estimation,’’ in Proc. ECCV, 2016, pp. 483–499.

[19] S.-E.Wei, V. Ramakrishna, T. Kanade, andY. Sheikh, ‘‘Convolutional posemachines,’’ in Proc. CVPR, Jun. 2016, pp. 4724–4732.

[20] Y. Zhou, H. Qi, and Y. Ma, ‘‘End-to-end wireframe parsing,’’ 2019,arXiv:1905.03246. [Online]. Available: https://arxiv.org/abs/1905.03246

[21] M. Brown, D. Windridge, and J.-Y. Guillemaut, ‘‘A generalisable frame-work for saliency-based line segment detection,’’ Pattern Recognit.,vol. 48, no. 12, pp. 3993–4011, Dec. 2015.

[22] M. Nieto, C. Cuevas, L. Salgado, and N. García, ‘‘Line segment detectionusing weighted mean shift procedures on a 2D slice sampling strategy,’’Pattern Anal. Appl., vol. 14, no. 2, pp. 149–163, 2011.

[23] X. Lu, J. Yao, K. Li, and L. Li, ‘‘CannyLines: A parameter-free linesegment detector,’’ in Proc. ICIP, Sep. 2015, pp. 507–511.

[24] X. Liu, Z. Cao, N. Gu, S. Nahavandi, C. Zhou, and M. Tan, ‘‘Intelligentline segment perception with cortex-like mechanisms,’’ IEEE Trans. Syst.,Man, Cybern., Syst., vol. 45, no. 12, pp. 1522–1534, Dec. 2015.

[25] J. Matas, C. Galambos, and J. Kittler, ‘‘Robust detection of lines using theprogressive probabilistic Hough transform,’’ Comput. Vis. Image Under-stand., vol. 78, no. 1, pp. 119–137, Apr. 2012.

[26] Y. Furukawa and Y. Shinagawa, ‘‘Accurate and robust line segment extrac-tion by analyzing distribution around peaks in Hough space,’’ Comput. Vis.Image Understand., vol. 92, no. 1, pp. 1–25, 2003.

[27] Z. Xu, B.-S. Shin, and R. Klette, ‘‘Accurate and robust line segmentextraction using minimum entropy with Hough transform,’’ IEEE Trans.Image Process., vol. 24, no. 3, pp. 813–822, Mar. 2015.

[28] P. Dollár and C. L. Zitnick, ‘‘Structured forests for fast edge detection,’’ inProc. ICCV, Dec. 2013, pp. 1841–1848.

[29] S. Xie and Z. Tu, ‘‘Holistically-nested edge detection,’’ in Proc. ICCV,Dec. 2015, pp. 1395–1403.

[30] K.-K. Maninis, J. Pont-Tuset, P. Arbeláez, and L. Van Gool, ‘‘Convolu-tional oriented boundaries,’’ in Proc. ECCV, 2016, pp. 580–596.

[31] Y. Liu, M.-M. Cheng, X. Hu, J.-W. Bian, L. Zhang, X. Bai, and J. Tang,‘‘Richer convolutional features for edge detection,’’ IEEE Trans. PatternAnal. Mach. Intell., vol. 41, no. 8, pp. 1939–1946, Aug. 2019.

[32] Z. Zhang, Z. Li, N. Bi, J. Zheng, J. Wang, K. Huang, W. Luo, Y. Xu, andS. Gao, ‘‘PPGNet: Learning point-pair graph for line segment detection,’’in Proc. CVPR, Jun. 2019, pp. 7105–7114.

[33] A. Saxena, S. H. Chung, and A. Y. Ng, ‘‘3-D depth reconstruction from asingle still image,’’ Int. J. Comput. Vis., vol. 76, no. 1, pp. 53–69, 2008.

[34] D. Eigen, C. Puhrsch, and R. Fergus, ‘‘Depth map prediction from asingle image using a multi-scale deep network,’’ in Proc. NIPS, 2014,pp. 2366–2374.

[35] D. F. Fouhey, A. Gupta, and M. Hebert, ‘‘Data-driven 3D primitives forsingle image understanding,’’ in Proc. ICCV, Dec. 2013, pp. 3392–3399.

[36] D. F. Fouhey, A. Gupta, and M. Hebert, ‘‘Unfolding an indoor origamiworld,’’ in Proc. ECCV, 2014, pp. 687–702.

[37] O. Haines and A. Calway, ‘‘Recognising planes in a single image,’’ IEEETrans. Pattern Anal.Mach. Intell., vol. 37, no. 9, pp. 1849–1861, Sep. 2015.

[38] R. Guo, C. Zou, and D. Hoiem, ‘‘Predicting complete 3D mod-els of indoor scenes,’’ 2015, arXiv:1504.02437. [Online]. Available:https://arxiv.org/abs/1504.02437

[39] A. Mallya and S. Lazebnik, ‘‘Learning informative edge maps for indoorscene layout prediction,’’ in Proc. ICCV, Dec. 2015, pp. 936–944.

[40] Y. Ren, C. Chen, S. Li, and C.-C. J. Kuo, ‘‘A coarse-to-fine indoor layoutestimation (CFILE) method,’’ 2016, arXiv:1607.00598. [Online]. Avail-able: https://arxiv.org/abs/1607.00598

[41] S. Dasgupta, K. Fang, K. Chen, and S. Savarese, ‘‘Delay: Robust spatiallayout estimation for cluttered indoor scenes,’’ in Proc. CVPR, Jun. 2016,pp. 616–624.

[42] C. Liu, J. Yang, D. Ceylan, E. Yumer, and Y. Furukawa, ‘‘PlaneNet: Piece-wise planar reconstruction from a single RGB image,’’ in Proc. CVPR,Jun. 2018, pp. 2579–2588.

[43] F. Yang and Z. Zhou, ‘‘Recovering 3D planes from a single image viaconvolutional neural networks,’’ in Proc. ECCV, 2018, pp. 87–103.

[44] X. Liang, L. Lin, Y. Wei, X. Shen, J. Yang, and S. Yan, ‘‘Proposal-free network for instance-level object segmentation,’’ IEEE Trans. PatternAnal. Mach. Intell., vol. 40, no. 12, pp. 2978–2991, Dec. 2018.

[45] M. Bai and R. Urtasun, ‘‘Deep watershed transform for instance segmen-tation,’’ in Proc. CVPR, Jul. 2017, pp. 5221–5229.

[46] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, andK.Murphy,‘‘Personlab: Person pose estimation and instance segmentation with abottom-up, part-based, geometric embedding model,’’ in Proc. ECCV,Sep. 2018, pp. 269–286.

[47] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time object detection with region proposal networks,’’ in Proc. NIPS, 2015,pp. 91–99.

[48] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, ‘‘Focal loss for denseobject detection,’’ in Proc. ICCV, Oct. 2017, pp. 2980–2988.

[49] P. Denis, J. H. Elder, and F. J. Estrada, ‘‘Efficient edge-based methods forestimating manhattan frames in urban imagery,’’ in Proc. ECCV, 2008,pp. 197–210.

[50] E. J. Almazàn, R. Tal, Y. Qian, and J. H. Elder, ‘‘MCMLSD: A dynamicprogramming approach to line segment detection,’’ in Proc. CVPR,Jul. 2017, pp. 5854–5862.

[51] D. R. Martin, C. C. Fowlkes, and J. Malik, ‘‘Learning to detect naturalimage boundaries using local brightness, color, and texture cues,’’ IEEETrans. Pattern Anal. Mach. Intell., vol. 26, no. 5, pp. 530–549, May 2004.

KUN HUANG received the B.Eng. degree fromthe University of Science and Technology ofChina, in 2013. He is currently pursuing the Ph.D.degree with the Shanghai Institute of Microsystemand Information Technology, Chinese Academy ofSciences, Shanghai, China.

SHENGHUA GAO received the B.Eng. degreefrom the University of Science and Technology ofChina, in 2008 (outstanding graduates), and thePh.D. degree from Nanyang Technological Uni-versity, in 2012. From June 2012 to July 2014,hewas a Postdoctoral Fellow of theAdvancedDig-ital Sciences Center, Singapore. He is currently anAssistant Professor with ShanghaiTech University,China. His research interests include computervision and machine learning.

141044 VOLUME 7, 2019

Documents

Wireframe Parsing With Guidance of Distance Map...K. Huang, S. Gao: Wireframe Parsing With Guidance of Distance Map FIGURE 2. The overall architecture of our end-to-end Wireframe parser