8
Leveraging Planar Regularities for Point Line Visual-Inertial Odometry Xin Li *1,2 , Yijia He *2 , Jinlong Lin 1 , Xiao Liu 2 Abstract— With monocular Visual-Inertial Odometry (VIO) system, 3D point cloud and camera motion can be estimated simultaneously. Because pure sparse 3D points provide a structureless representation of the environment, generating 3D mesh from sparse points can further model the environment topology and produce dense mapping. To improve the accuracy of 3D mesh generation and localization, we propose a tightly- coupled monocular VIO system, PLP-VIO, which exploits point features and line features as well as plane regularities. The co- planarity constraints are used to leverage additional structure information for the more accurate estimation of 3D points and spatial lines in state estimator. To detect plane and 3D mesh robustly, we combine both the line features with point features in the detection method. The effectiveness of the proposed method is verified on both synthetic data and public datasets and is compared with other state-of-the-art algorithms. I. I NTRODUCTION Simultaneous motion estimating and dense mapping are important to many robotic applications like autonomous driv- ing, augmented reality, and building inspection. Camera and inertial measurement units (IMUs) are low-cost and effective sensors, based on which the Visual-Inertial Odometry (VIO) system can be implemented. Existing VIO methods [1]–[3] can estimate pose accurately but only provide a sparse map of 3D points. In [4], [5] dense mapping can be realized based on monocular vision and pixel-wise reconstruction, which is time-consuming and highly relies on Graphics Processing Unit (GPU) to guarantee real-time performance. For light-weight dense mapping, Lucas et al. [6] used sparse point features on image to generate 2D Delaunay triangulation and build a dense map for aerial inspection. 3D Delaunay triangulation, which is generated from sparse 3D points estimated by ORB-SLAM, is used to build a dense map in real-time for robot navigation [7]. However, these algorithms decouple pose estimation and dense mapping, causing information loss. Antoni et al. [8], [9] proposed an incremental visual-inertial 3D mesh generation system that uses 3D mesh to detect plane and enforces planar structural regularities to improve localization accuracy. However, the spares points-based map is structureless so that the mesh does not fit the environment well. For heterogeneous features utilization, the line and plane feature have attracted researchers’ attention. In our previous work [10], The environment reconstructed with 3D straight lines can reflect more geometric structure information than * Equal contribution 1 Xin Li and Jinlong Lin are School of Software & Microelectron- ics, Peking University, Beijing, China, {[email protected], [email protected]} 2 Xin Li, Yijia He and Xiao Liu are with the Megvii (Face++) Technology Inc., Beijing, China, {heyijia, [email protected]} Fig. 1: The proposed monocular VIO system that detects planes and builds 3D mesh of the environment by combining 3D sparse points and structural lines. (a) 2D Delaunay triangulation of point features and line features. (b) Plane detection (blue) with the 2D triangulation result and the 3D structural line (green). (c) 3D mesh of the environment and the related structural lines (green). point features. Moreover, line features incorporated into the SLAM system can improve pose accuracy and the robustness under illumination-changing scenes [10], [11]. However, they do not consider the geometric constraint between point and line features. Zou et al. [12] proposed a monocular line based VIO system leveraging structural regularity from the Atlanta world model. But point features are not introduced into their StructVIO system. Moreover, all these works did not use structural landmarks to build a dense map. Planar constraints used in these SLAM system are more often relies on the depth sensor [13]–[15]. Lu et al. [16] presented a multi-layer feature graph utilizing points, lines, planes, and vanishing points to improve the accuracy of bundle adjustment. How- ever, they used RANSAC-based approaches to detect planes which require numerous iteration and are time-consuming. Yang et al. [17] investigated the observability of the VIO using heterogeneous features of points, lines, and planes. [18] proposed a unified representation for all the point, line and plane features but they still need depth sensors for plane detection. arXiv:2004.11969v1 [cs.CV] 16 Apr 2020

Xin Li , Yijia He , Jinlong Lin , Xiao Liu · 2020-04-28 · Xin Li 1; 2, Yijia He , Jinlong Lin , Xiao Liu2 Abstract—With monocular Visual-Inertial Odometry (VIO) system, 3D point

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Xin Li , Yijia He , Jinlong Lin , Xiao Liu · 2020-04-28 · Xin Li 1; 2, Yijia He , Jinlong Lin , Xiao Liu2 Abstract—With monocular Visual-Inertial Odometry (VIO) system, 3D point

Leveraging Planar Regularities for Point Line Visual-Inertial Odometry

Xin Li∗1,2, Yijia He∗2, Jinlong Lin1, Xiao Liu2

Abstract— With monocular Visual-Inertial Odometry (VIO)system, 3D point cloud and camera motion can be estimatedsimultaneously. Because pure sparse 3D points provide astructureless representation of the environment, generating 3Dmesh from sparse points can further model the environmenttopology and produce dense mapping. To improve the accuracyof 3D mesh generation and localization, we propose a tightly-coupled monocular VIO system, PLP-VIO, which exploits pointfeatures and line features as well as plane regularities. The co-planarity constraints are used to leverage additional structureinformation for the more accurate estimation of 3D points andspatial lines in state estimator. To detect plane and 3D meshrobustly, we combine both the line features with point featuresin the detection method. The effectiveness of the proposedmethod is verified on both synthetic data and public datasetsand is compared with other state-of-the-art algorithms.

I. INTRODUCTION

Simultaneous motion estimating and dense mapping areimportant to many robotic applications like autonomous driv-ing, augmented reality, and building inspection. Camera andinertial measurement units (IMUs) are low-cost and effectivesensors, based on which the Visual-Inertial Odometry (VIO)system can be implemented. Existing VIO methods [1]–[3]can estimate pose accurately but only provide a sparse mapof 3D points. In [4], [5] dense mapping can be realized basedon monocular vision and pixel-wise reconstruction, which istime-consuming and highly relies on Graphics ProcessingUnit (GPU) to guarantee real-time performance.

For light-weight dense mapping, Lucas et al. [6] usedsparse point features on image to generate 2D Delaunaytriangulation and build a dense map for aerial inspection. 3DDelaunay triangulation, which is generated from sparse 3Dpoints estimated by ORB-SLAM, is used to build a densemap in real-time for robot navigation [7]. However, thesealgorithms decouple pose estimation and dense mapping,causing information loss. Antoni et al. [8], [9] proposed anincremental visual-inertial 3D mesh generation system thatuses 3D mesh to detect plane and enforces planar structuralregularities to improve localization accuracy. However, thespares points-based map is structureless so that the meshdoes not fit the environment well.

For heterogeneous features utilization, the line and planefeature have attracted researchers’ attention. In our previouswork [10], The environment reconstructed with 3D straightlines can reflect more geometric structure information than

* Equal contribution1Xin Li and Jinlong Lin are School of Software & Microelectron-

ics, Peking University, Beijing, China, {[email protected],[email protected]}

2Xin Li, Yijia He and Xiao Liu are with the Megvii (Face++) TechnologyInc., Beijing, China, {heyijia, [email protected]}

Fig. 1: The proposed monocular VIO system that detectsplanes and builds 3D mesh of the environment by combining3D sparse points and structural lines. (a) 2D Delaunaytriangulation of point features and line features. (b) Planedetection (blue) with the 2D triangulation result and the 3Dstructural line (green). (c) 3D mesh of the environment andthe related structural lines (green).

point features. Moreover, line features incorporated into theSLAM system can improve pose accuracy and the robustnessunder illumination-changing scenes [10], [11]. However, theydo not consider the geometric constraint between point andline features. Zou et al. [12] proposed a monocular line basedVIO system leveraging structural regularity from the Atlantaworld model. But point features are not introduced into theirStructVIO system. Moreover, all these works did not usestructural landmarks to build a dense map. Planar constraintsused in these SLAM system are more often relies on thedepth sensor [13]–[15]. Lu et al. [16] presented a multi-layerfeature graph utilizing points, lines, planes, and vanishingpoints to improve the accuracy of bundle adjustment. How-ever, they used RANSAC-based approaches to detect planeswhich require numerous iteration and are time-consuming.Yang et al. [17] investigated the observability of the VIOusing heterogeneous features of points, lines, and planes.[18] proposed a unified representation for all the point, lineand plane features but they still need depth sensors for planedetection.

arX

iv:2

004.

1196

9v1

[cs

.CV

] 1

6 A

pr 2

020

Page 2: Xin Li , Yijia He , Jinlong Lin , Xiao Liu · 2020-04-28 · Xin Li 1; 2, Yijia He , Jinlong Lin , Xiao Liu2 Abstract—With monocular Visual-Inertial Odometry (VIO) system, 3D point

In this paper, we propose a tightly-coupled monocular VIOsystem that involves heterogeneous visual features includingpoints, lines, and planes as well as their co-planarity con-straints (PLP-VIO), to realize accurate motion estimationand dense mapping, as shown in Fig.1. A related work[8] also utilized planar regularities, which only used sparsepoint features. In comparison, our proposed method not onlyleveraged the planar regularities but also involves both pointfeatures and line features to detect planes, so that richerstructural information is used for 3D mesh generation. Themain contributions of this work include:

• We propose a non-iterative plane detection method and a3D mesh generation method based on sparse points andspatial lines for monocular VIO. Our proposed methodutilizes structural line to improves the correctness androbustness of plane detection and mesh generation com-pared to the method involving only sparse points.

• We develop a tightly-coupled monocular VIO sys-tem that utilizes heterogeneous visual features, includepoints, lines, and planes, as well as their co-planarityconstraints. The richer visual information and spatialconstraints between landmarks improve the estimatoraccuracy and robust.

• We provide an ablation experiment to verify the ef-fectiveness of heterogeneous features on both syntheticdata and the EuRoc dataset [19]. Experimental resultsdemonstrated that our system simultaneously improvesthe accuracies of pose estimation and mapping.

II. SYSTEM OVERVIEW

The proposed system is derived from our previous workPL-VIO [10] which incorporate line segments into Vins-Mono [2]. As shown in Fig. 2, our system contains twomodules: the front end and the back end. In the front end, rawmeasurements from IMU and camera are pre-processed. Therelated operations include IMU pre-integration, detection,and matching of point-line features, which are described in[10]. Here we mainly introduce the plane-related operationsin the back end.

First, the 2D point and line features are triangulated toestimate the 3D point and line landmarks in the map. Second,these 3D landmarks are used to generate plane and 3D mesh.Also, co-planar constraints are extracted from the planes toconstraint the 3D landmarks. A detailed description of theplane reconstruction will be presented in Sec. III. Third,the IMU body state and 3D landmarks in the map will beoptimized with the sliding window optimization by minimiz-ing the sum of the IMU residuals, the vision re-projectionresiduals, and the co-planar constraints residuals. All theseresiduals will be described in Sec. IV. The marginalizationstrategy of the keyframe in the sliding window can be foundat [2]. Finally, the invalid planes are culled. If active pointand line features belonging to a plane are fewer than thethreshold number δπ = 30, this plane is culled from themap and the co-planar constraints with this plane are alsoremoved.

IMU

Image

IMU Pre-integration

Feature Detection and Matching

Point FeaturesFAST

Line FeatruesLSD-LBD

Point FeaturesMatching

Line FeatruesMatching

Front End

Back End

Outlier Culling

Point Culling

Line Culling

Plane Culling

New Landmarks

Point Reconstruction

3D Mesh Generation

Line Reconstruction

Sliding Window Optimization

Prior Information

States Optimization

and Marginalization

IMU Residuals

Point and Line Projection Residuals

Point and Line on Plane Residuals

Plane Reconstruction

Fig. 2: Overview of our PLP-VIO system. The Front-Endmodule is used to extract information from the raw mea-surements; The 3D points, lines, mesh map, and the bodystates are estimated with sliding window optimization in theBack-End.

III. CO-PLANARITY DETECTION

Given a set of 3D points and spatial lines, the planes andtheir co-planarity are detected. Below we introduce the land-marks representation, the 3D mesh generation from point-line features, and plane detection with sparse landmarkssequentially.

A. Landmarks Representation

Since 3D plane detection is based on the 3D landmarksrepresentation, here we briefly introduce the geometry mean-ing and parameterization of these 3D landmarks.

1) Point representation: We use the inverse depth λ ∈ Rto parameterize the point landmark from the first keyframein which this point is observed. Given its observation z =[u, v, 1]> in the normalized image plane I (located at thefocal length = 1), we can get its 3D position by f = 1

λ · z.2) Line representation: Plucker coordinates Lc =[

nc>,dc>]>

are used for 3D line initialization and trans-formation. dc ∈ R3 is the line’s direction vector in cameraframe c, and nc ∈ R3 is the normal vector of the planedetermined by this line and the camera frame’s origin point.The Plucker coordinates are over-parameterized since thereis an implicit constraint between the vector n and d, i.e.,n>d = 0. The orthonormal representation O = (U,W) ∈SO(3) × SO(2) is used in optimization to avoid over-parameterization. The geometric representation of 3D lineand the transformation between parameterized forms areintroduced in detail at [20].

3) Plane representation: We denote a plane in worldframe w by π =

[n>, d>

]>, where n ∈ R3 is the plane’s

normal vector, d ∈ R is the distance from frame origin toplane. The unit normal vector n have three parameters butonly two Degrees of Freedom (DoF). We therefore optimizethe plane normal vector n on its tangent space during op-timization. Inspired by the gravity vector’s parameterizationin [2], we represent n as n+w1b1 +w2b2, where n is a unit

Page 3: Xin Li , Yijia He , Jinlong Lin , Xiao Liu · 2020-04-28 · Xin Li 1; 2, Yijia He , Jinlong Lin , Xiao Liu2 Abstract—With monocular Visual-Inertial Odometry (VIO) system, 3D point

Tangent space

Fig. 3: Illustration of the parameterization of the 2 DoFnormal vector with two orthogonal basis b1,b2, which spanthe tangent space.

normal vector. b1 and b2 are two orthogonal basis spanningthe tangent plane, as shown in Fig.3.

B. 3D Mesh Generation

Since it is difficult to directly generate 3D mesh fromsparse 3D landmarks, we build 3D mesh from 2D Delaunaytriangulation using both point and line features, as shownin Fig.4. Firstly, we detect point and line features in thelatest keyframe. Then, we use the constrained Delaunaytriangulation algorithm [21] to build a 2D mesh from thesepoints and line segments. Compared to the basic Delaunaytriangulation, the constrained Delaunay triangulation retainsthe observed line edges on the image frame. Thus, thestructural information of line features is preserved.

However, there might be some invalid 3D triangularpatches since they build from 2D observation. To removethese outliers, we filter the patches according to two con-ditions: if a 3D triangular patch is on a plane, there mustbe more than two adjacent patches that the angle betweentheir normal vectors is less than a certain threshold. The otherone condition is used in [8] that 3D triangular patches shouldnot be particularly sharp triangles, such as triangles with theaspect ratio higher than 20 or an acute angle smaller than5deg.

We maintain the update of the 3D mesh map over thesliding window. When 3D points and lines are marginalized,the associated mesh will be fixed. Considering the instabilityof the straight-line endpoint detection, we do not directlypropagate the mesh between frames instead of detectingmesh on every new frame and merging it into the localmap. Mesh fusion is conducted by comparing the current3D triangular patches to the previous 3D mesh in thesliding window and removing duplicated patches. If the stateestimator does not provide any 3D points or lines, whichmight happen during fast rotation, the 3D mesh will not begenerated.

C. Plane Detection

After getting the 3D mesh in the scene, we can detectplanes from these triangular patches and spatial lines asshown in Fig.4. Utilizing both triangular patches and allspatial lines to detect planes is more suitable for not onlypoint-rich but also textureless scenes, compared to utilizingonly triangular patches or only lines.

Fig. 4: 3D mesh generation and plane detection. 2D Delaunaytriangulation (green) build with point (black dot) and line(red) features in image plane I are used to build 3D mesh(blue patches). Plane πj is detected with 3D mesh and spatiallines. Plane πi is detected mainly using spatial lines whenpoint features are few.

Before finding the plane, we collect useful informationfrom triangular patches and spatial lines. For each triangularpatch, we compute its normal direction nπ . For spatial lines,the plucker coordinates treat a 3D line as an infinite straightline, and we select two 3D points on the line to help to detectplane. Since the observation of a straight line in an image is aline segment, the endpoints of a line segment can be used tocalculate the two 3D points on a spatial line [20]. Moreover,the reconstructed 3D lines have noise, we need to filter ormerge some 3D lines. If two lines have similar directions dand are close enough in 3D space, we will regard them asthe same straight line in the process of plane detection.

In order to extract planes in a non-iterative manner, wedivide the planes in space into horizontal planes, verticalplanes, and other planes. The horizontal and vertical planescan be identified using the direction of gravity measured byIMU measurements. As to horizontal plane detection, wecollect all triangular patches and spatial lines that are verticalto the gravity direction. Then, we build a 1D histogram ofthe height of vertices of triangular patches and the heightof 3D points on lines. Moreover, considering the structuralinformation of lines, we give higher weights to 3D points onthe lines than to other points in statistics. The weight of a lineequals to the total weights of four points. After statistics, aGaussian filter is used to eliminate multiple local maximumsas used in [8]. Finally, we extract the local maximum ofthe histogram which exceeds the threshold (σt = 20) andconsiders it a valid plane.

As to vertical plane detection, we collected patches andlines which is approximately vertical to a horizontal normal.A 2D histogram is built, whose one axis is the azimuth of thepatches’ normal vectors or lines’ direction vectors in worldframe, and the other axis indicates the projection distancefrom a vector, pointing from the world frame origin to a3D point, onto the normal vector which perpendicular to anazimuth. Then, we select the candidate planes in a similar

Page 4: Xin Li , Yijia He , Jinlong Lin , Xiao Liu · 2020-04-28 · Xin Li 1; 2, Yijia He , Jinlong Lin , Xiao Liu2 Abstract—With monocular Visual-Inertial Odometry (VIO) system, 3D point

way of horizontal plane detection.As for other planes, the above non-iterative method cannot

be adopted since the angle of the normal vector have 2 DoF.Considering the efficiency of plane detection, these planesare not detected in our system. Fortunately, a lot of patchesand lines have been classified after detecting the horizontaland vertical planes in a man-made structural environment.

After plane detection, we need to detect whether a planeis already in the estimator according to their parameters π.If close enough they will be removed to avoid duplicatedplane variables. To generate planar regularity to improvethe VIO accuracy, we will assign the 3D points and spatiallines to their corresponding plane when the distance fromlandmark to plane is lower than a threshold. After the slidingwindow optimization, if the distance from the point or theline endpoints to the plane exceeds 3cm, then the point or linewill not be constrained by the planar regularity, and mesheswhich use these points or lines as vertices will be removed.

IV. PL-VIO WITH PLANAR CONSTRAINT

In this section, we build our VIO system with the slidingwindow optimization which estimates body states and 3Dlandmarks by fusing IMU and visual information.

A. Formulation

The full state vector of the proposed VIO system withina sliding window contains the IMU states, point features’depths, 3D line landmarks, and plane states. The states attime t are defined as:

X t.= {xi, λk,Owl ,πwh }i∈Bt,k∈Ft,l∈Lt,h∈Πt

xi.=[pwbi ,qwbi ,v

wbi ,b

bia ,b

big

] (1)

where the set Bt contains the active IMU body states in thesliding window at time t. The sets Ft, Lt and Πt contain thepoint, line and plane states active in sliding window at timet, respectively. xi is composed of ith IMU body positionpwbi ∈ R3, orientation qwbi , velocity vwbi ∈ R3 with respectto the world frame w. bbia ,b

big ∈ R3 are accelerometer bias

and gyroscope bias in body frame bi, respectively. We usequaternion to represent orientation within state vector and userotation matrix R ∈ SO(3) to rotate vectors. For landmarks,We use the inverse depth λk to parameterize the kth pointlandmark from the first keyframe in which the landmark isobserved. Owl and πwh are the orthonormal representationsof the lth line feature and the hth plane in the world framew, respectively.

We optimize all the state variables in the sliding windowby minimizing the sum of cost terms from all the measure-ment residuals:

X = arg minX‖rp −HpX‖2 +

∑i∈B‖rb‖2Σbibi+1

+∑i∈B

(∑k∈F

ρ

(∥∥∥rcifk∥∥∥2

ΣF

)+∑l∈L

ρ(∥∥rciLl∥∥2

ΣL

))

+∑h∈Π

(∑k∈F

ρ

(∥∥∥rπhfk ∥∥∥2

ΣπF

)+∑l∈L

ρ(∥∥rπhLl ∥∥2

ΣπL

))

Fig. 5: Illustration of visual residuals (red dash line) : pointre-projection residual rcf , line re-projection residual rcL, thedistance from 3D point to plane rπf , the distance from 3Dline to plane rπL.

where rb is the IMU measure residuals. rcifk and rciLl are re-projection residual of point and line, respectively. rπhfk andrπhLl are point-on-plane residual and line-on-plane residual,respectively. An illustration of these residuals is shown inFig.5. rp and Hp are the prior residual and Jacobian frommarginalization operator, respectively [2]. Since the IMUpre-intergration residual and sliding window marginalizationis not our main contribution, their details can be found in[2]. ρ (·) is the Cauchy robust funtion used to suppressoutliers. Σ(·) is the covariance matrix of a measurement.The covariance matrices of measurements are constant in ourexperiments except Σbibi+1 , which is calculated by covari-ance propagation with IMU measurement noise. The settingof each covariance will be introduced in the subsequentsections. The Ceres solver [22] is used to solve this nonlinearproblem.

B. Point Feature Measurement Model

For point features, the re-projection error of a 3D pointis the distance between the projected point and the observedpoint in the normalized image plane. Given the kth pointfeature measurement at frame cj , z

cjfk

= [ucjfk, vcjfk, 1]>, the

re-projection error is defined as:

rcifk =

[xcj

zcj− ucjfk

ycj

zcj− vcjfk

]

fcjk =

xcjycjzcj

= R>bc(R>wbj (Rwbi((Rbc

1

λk

ucifkvcifk1

+ pbc) + pwbi)− pwbj )− pbc)

(2)

where zcifk = [ucifk , vcifk, 1]> is the first observation of the

feature in camera frame ci. Rbc and tbc is the extrinsicparameters between the camera frame and the IMU bodyframe. The setting of the covariance matrix ΣF is based onthe assumption that the feature point coordinates are affectedby isotropic white Gaussian noise. The standard deviation ofthe noise is set with 1 pixel.

C. Line Feature Measurement Model

Usually, the re-projection error of a line segment is definedas the distance from its endpoints to the projected line in the

Page 5: Xin Li , Yijia He , Jinlong Lin , Xiao Liu · 2020-04-28 · Xin Li 1; 2, Yijia He , Jinlong Lin , Xiao Liu2 Abstract—With monocular Visual-Inertial Odometry (VIO) system, 3D point

normalized image plane. However, the observation of thesame spatial line in different frames have different lengths.Therefore, we normalize the distance with the line segmentlengths as our line re-projection residuals.

Given a 3D line with its Plucker coordinates in cameraframe Lcl = [ncl ,d

cl ]. The measurement of the line segment

zcLl in the normalized image plane consists with two end-points scl = [us, vs, 1]> and ecl = [ue, ve, 1]>. The line re-projection residual is defined as:

rcLl =1∥∥zcLl∥∥

[d(scl ,n

cl )

d(ecl ,ncl )

](3)

with d(s,n) indicating the distance function from endpoints to the projection line l:

d(s,n) =s>n√n2

1 + n22

(4)

For the setting of covariance ΣL, we assume that the end-point of the line segment is also affected by white Gaussiannoise as with point features.

D. Plane Feature Measurement Model

3D landmarks co-planar constraints are defined by thedistances from 3D landmarks to the 3D plane.

1) Point on plane residual: Given a 3D point fk on theplane πh, the residual is defined as:

rπhfk = nπ · fk − dπ (5)

2) Line on plane residual: A line on the plane can provideinformation from two aspects [18].• every point on this spatial line must be on the plane.• the direction of this line and the plane normal vector

are orthogonal.Hence, given 3D lines Plucker coordinates Ll = [nl,dl]

>,the residual is given by:

1rπhLl

= nπ ·Q− dπ2rπhLl

= nπ · dl(6)

where Q = nl×dl||dl|| corresponds to a point in the 3D line.

According to the plane detection in Sec.III-C, the residualerror of the distance from the point to the plane will notexceed 3cm, we model the distance residual with whiteGaussian noise with the standard deviation is 1cm. As forthe angle uncertainty of 2r

πhLl

, the standard deviation is setwith 1deg.

V. EXPERIMENTAL RESULTS

To evaluate the proposed approach we conduct experi-ments on synthetic data and the EuRoc dataset. The tra-jectory accuracy, mapping quality, and runtime are analyzed.In these experiments, the algorithm ran on a computer withIntel Core i7-8550U @ 1.8GHz, 16GB memory and the ROSKinetic [23].

To assess the advantages of our proposed approach, wecompared our method with OKVIS with monocular mode

[3], VINS-Mono [2] without loop closure, PL-VIO [10], PP-VIO [8]. For simplicity and comparison, we denote VINS-Mono as P which only uses point features, the method in [8]as PP which uses point features and co-planarity constraints,and PL-VIO as PL which uses both point and line features.Our method is denoted as PLP which uses point, line andco-planarity constraints. For the effectiveness of the ablationstudy on heterogeneous features, we reimplement PP withVINS-Mono so that P, PP, PL, and PLP are all extended fromthe VINS-Mono code. The same parameters configurationare also used. We chose the absolute pose error (APE) asthe main evaluation metric, which directly compared thetrajectory error between the estimate and the ground truth[24]. We also use relative pose error (RPE) to further obtainthe qualitative analysis. The TUM evaluation toolkit [24] wasused to calculated APE and RPE in our experiments.

A. Synthetic DataTo verify the validity of VIO estimator with planar con-

straint, we generated a synthetic environment in which lineand point features are distributed on the walls of a squareroom, as shown in Fig. 6. The virtual camera moved in theroom, collecting images of 640 × 480 pixels. Gaussian whitenoises were added to the points and endpoints of lines in thecaptured images.

Table I summarizes the results of the methods P, PP, PLand the proposed PLP on the synthetic data. The root meansquared error (RMSE) of APE was used to evaluate the accu-racy. Map error means the root mean square error of distancebetween true 3D position and the position we estimated.The results showed co-planarity constraints could improvethe trajectory accuracy and our proposed PLP outperformedthe comparison methods. For mapping quality, it can be seenthat the largest map error from PL is caused by the inaccuratetriangulation of the straight line endpoints. However, planeconstraint can effectively improve the accuracy of the straightline and the mapping results.

Fig. 6: A Synthetic scene with line features (blue), pointfeatures (red) and camera trajectory (green).

TABLE I: Results for pipelines P, PP, PL and our proposedaprroach PLP on the synthetic data. In bold the best result.

P PP PL PLPRMSE APE (cm) 8.94 8.06 7.97 7.34Map error (cm) 3.21 1.57 3.44 1.32

Page 6: Xin Li , Yijia He , Jinlong Lin , Xiao Liu · 2020-04-28 · Xin Li 1; 2, Yijia He , Jinlong Lin , Xiao Liu2 Abstract—With monocular Visual-Inertial Odometry (VIO) system, 3D point

B. EuRoc Dataset

The EuRoc micro aerial vehicle (MAV) datasets werecollected by a MAV in two indoor scenes, which containstereo images from a global shutter camera at 20FPS andsynchronized IMU measurements at 200 Hz. The two scenescontain walls, floor, and some other planar objects. Eachdataset had a ground truth trajectory and a ground truthpoint cloud which can be used to evaluate localizationaccuracy and the 3D mesh map quality. In our experiments,the extrinsic and intrinsic parameters are specified by thedatasets. Besides, we only used the images of the left camera.

1) Localization Accuracy: we evaluated the RMSE APEof pipelines OKVIS, P, PP, PL and PLP on EuRoc dataset.Tab. II shows the results of different methods. OKVIS resultsare from [10]. For the experiment details of P, PP, PL andPLP, we use the default parameters from [2], except thenumber of features. To improve the effectiveness of meshdetection, we increase the number of point features that eachnew frame image will collect 200 point features and all linefeatures with its length more than 50 pixels.

From the Tab. II, PLP can improve the performance of PLon all sequences, but PP is not necessarily effective for P.This shows that using only sparse point features to detectthe plane can sometimes bring incorrect structural infor-mation, and the introduction of straight lines can improvethe accuracy of plane detection. In addition, Our approachusing all point, line and plane features achieves the smallesttranslation error on all sequences except for V1 02. Thesequences of V1 were collected in a square room withrich textures on walls and floors. Further analysis of theresults on V1 01 and V1 02 shows the texture is so richthat P with only point features has achieved high accuracyand mismatched straight line features in PL will reduce theaccuracy. However, since the camera movement speed inthe V1 03 is very fast compared to V1 01 and V1 02,the blurring and illumination-changing caused by high-speedmotion will decrease the accuracy of P while the straight linefeatures matching and detection are more stable so that theaccuracy of PL is higher than P. The structure regularities ofplane are valid for P and PL on all the sequences of V1.

To get more insight into the effectiveness of our system,RPE is used to investigate the local accuracy of a trajectory.Fig.7 shows the results on MH 05. PLP presented the lowertranslation error of the RPE than the other methods, espe-cially within the time range 57 ∼ 67s. An image examplerecorded within this time range is shown in Fig.8, whichdemonstrated a scene with few point features and many struc-tural lines on a plane. Compared with point features, the useof line features or plane structural regularities can effectivelyimprove the positioning accuracy when point features cannotbe effectively used for localization. In addition, the accuracyof PP is sometimes lower than P, such as around t = 20sand t = 90 ∼ 100s. In comparison, straight lines make theplane detection more accurate, and the PLP error is smallerthan PL. Thus it verified the validity of the straight line inretaining spatial structure information.

TABLE II: The RMSE of the state-of-art methods compareto our PLP method on EuRoc dataset. The translation (cm)and rotation (deg) error are list as follows. In bold the bestresult.

Seq. OKVIS P PP PL PLPtrans. rot. trans. rot. trans. rot. trans. rot. trans. rot.

MH 02 36.0 3.5 15.3 2.1 14.4 2.0 11.7 1.8 10.9 1.7MH 03 21.5 1.5 18.5 1.3 18.6 1.3 15.4 1.1 14.9 1.1MH 04 24.0 1.1 24.7 1.7 24.5 1.7 21.6 1.5 20.6 1.4MH 05 39.6 1.1 22.2 1.3 22.3 1.3 18.3 1.2 16.6 1.2V1 01 8.6 5.5 6.1 5.2 5.7 5.2 6.4 5.7 5.4 5.2V1 02 12.2 2.3 7.5 1.8 7.4 1.8 7.6 1.8 7.5 1.8V1 03 19.6 3.8 12.8 4.4 12.3 4.0 11.4 3.8 11.0 3.3V2 02 18.2 2.7 15.7 3.2 13.7 2.6 15.0 2.4 12.5 2.4V2 03 30.5 4.3 20.0 3.4 16.8 3.4 16.6 3.2 15.4 2.4

0 20 40 60 80 100time [s]

0.000

0.025

0.050

0.075

0.100

0.125

0.150

0.175

transl

ati

onal err

or

[m]

P

PP

PL

PLP

Fig. 7: Translation error of RPE on dataset MH 05 with theapproaches P, PP, PL, PLP (proposed).

Fig. 8: An example of the scene with rich line features andfew point features on a plane. Detected points and lines aredrawn in red and blue, respectively.

Page 7: Xin Li , Yijia He , Jinlong Lin , Xiao Liu · 2020-04-28 · Xin Li 1; 2, Yijia He , Jinlong Lin , Xiao Liu2 Abstract—With monocular Visual-Inertial Odometry (VIO) system, 3D point

2) Mapping Quality: To evaluate the methods based onthe metric of mapping quality, which is defined in [25],we converted the generated a 3D mesh to a dense pointcloud by sampling the mesh with a uniform density of 103

points/m2. The resulting point cloud is aligned with theground truth point cloud using the Iterative Closest Point(ICP) algorithm. For each point, the mapping error was itsdistance to its nearest ground truth point. To further verifythat the geometric structure of the straight line is effectivefor plane detection and dense reconstruction, and to verifythat the plane structure can improve the accuracy of themap, we compared the PLP with P. For the method P,we use the sparse point features to construct the mesh fordense reconstruction, but do not use structural constraintsin optimization. The results given by P and PLP on datasetV1 01 are shown in Fig.9. PLP gave a more complete densemap than P because the straight line gave better structuralinformation during the mesh reconstruction process, espe-cially for objects with the textureless surfaces, such as thelarge box object marked by the red line in Fig.9. Moreover,the PLP has a lower mean distance error (2.8cm) than P(3.6cm) benefit from the planar regularities.

3) Runtime Evaluation: Because line and plane wereadded in the sliding window optimization, the state vector’ssize and the problem complexity are both increased. Here,we mainly compared the average execution time of the back-end module with other methods’, which are evaluated onthe V1 02 sequence. From Tab.III, we can observe thatour non-iterative method detected plane and created meshefficiently and the runtime was under 2ms. The time cost ofthe optimization and marginalization was increased by onlya few milliseconds. The reason was that Schur complementtrick was used to accelerate the solver, besides, the planenumber (≈ 5) and line number (≈ 30) were fewer comparedwith point number (≈ 200) in a sliding window.

TABLE III: Mean execution time (Unit: millisecond) ofdifferent approaches run with the sequence V1 02.

Module P PP PL PLPplane detection 0 0.39 0 0.44Mesh Creation 0 0.65 0 1.42Optimization 31.61 33.63 30.59 35.88

Marginalization 6.05 6.27 7.29 8.34

VI. CONCLUSIONSIn this paper, we present a novel VIO system combining

point, line and plane features, which is called PLP-VIO.Both points and structural lines are used to detect planeand build 3D mesh, which is superior to the detection withonly sparse points. The integration of planar regularity andheterogeneous visual features can significantly improve thelocalization accuracy and dense mapping accuracy simulta-neously.

In the future, we plan to embed the loop-closure modulein our VIO system to build a globally consistent dense map,for robot navigation applications.

0.00 0.300.150.05 0.250.200.10

point cloud error [m]

Fig. 9: Mapping quality comparison. The two colorful pointclouds are generated by P (middle) and PLP (bottom) runningon V1 01. Colors encode the corresponding mapping errors.The large box with textureless surfaces (red rectangle) (top)can be reconstructed by PLP.

REFERENCES

[1] A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint kalmanfilter for vision-aided inertial navigation,” in Proceedings 2007 IEEEInternational Conference on Robotics and Automation. IEEE, 2007,pp. 3565–3572.

[2] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monoc-ular visual-inertial state estimator,” IEEE Transactions on Robotics,vol. 34, no. 4, pp. 1004–1020, 2018.

[3] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale,“Keyframe-based visual–inertial odometry using nonlinear optimiza-tion,” The International Journal of Robotics Research, vol. 34, no. 3,pp. 314–334, 2015.

[4] M. Pizzoli, C. Forster, and D. Scaramuzza, “Remode: Probabilistic,monocular dense reconstruction in real time,” in 2014 IEEE Interna-tional Conference on Robotics and Automation (ICRA). IEEE, 2014,pp. 2609–2616.

[5] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense

Page 8: Xin Li , Yijia He , Jinlong Lin , Xiao Liu · 2020-04-28 · Xin Li 1; 2, Yijia He , Jinlong Lin , Xiao Liu2 Abstract—With monocular Visual-Inertial Odometry (VIO) system, 3D point

tracking and mapping in real-time,” in 2011 international conferenceon computer vision. IEEE, 2011, pp. 2320–2327.

[6] L. Teixeira and M. Chli, “Real-time mesh-based scene estimationfor aerial inspection,” in 2016 IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS). IEEE, 2016, pp. 4863–4869.

[7] Y. Ling and S. Shen, “Building maps for autonomous navigationusing sparse visual slam features,” in 2017 IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS). IEEE, 2017,pp. 1374–1381.

[8] A. Rosinol, T. Sattler, M. Pollefeys, and L. Carlone, “Incrementalvisual-inertial 3d mesh generation with structural regularities,” arXivpreprint arXiv:1903.01067, 2019.

[9] A. Rosinol, “Densifying sparse vio: a mesh-based approach usingstructural regularities,” Master’s thesis, ETH Zurich, Zurich; Cam-bridge, MA, 2018-09-14.

[10] Y. He, J. Zhao, Y. Guo, W. He, and K. Yuan, “Pl-vio: Tightly-coupledmonocular visual–inertial odometry using point and line features,”Sensors, vol. 18, no. 4, p. 1159, 2018.

[11] A. Pumarola, A. Vakhitov, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer, “Pl-slam: Real-time monocular visual slam with points andlines,” in 2017 IEEE International Conference on Robotics andAutomation (ICRA). IEEE, 2017, pp. 4503–4508.

[12] D. Zou, Y. Wu, L. Pei, H. Ling, and W. Yu, “Structvio: visual-inertialodometry with structural regularity of man-made environments,” IEEETransactions on Robotics, vol. 35, no. 4, pp. 999–1013, 2019.

[13] E. Ataer-Cansizoglu, Y. Taguchi, S. Ramalingam, and T. Garaas,“Tracking an rgb-d camera using points and planes,” in Proceedingsof the IEEE International Conference on Computer Vision Workshops,2013, pp. 51–58.

[14] M. Kaess, “Simultaneous localization and mapping with infiniteplanes,” in 2015 IEEE International Conference on Robotics andAutomation (ICRA). IEEE, 2015, pp. 4605–4611.

[15] Y. Yang, P. Geneva, X. Zuo, K. Eckenhoff, Y. Liu, and G. Huang,“Tightly-coupled aided inertial navigation with point and plane fea-tures,” in 2019 International Conference on Robotics and Automation(ICRA). IEEE, 2019, pp. 6094–6100.

[16] Y. Lu and D. Song, “Visual navigation using heterogeneous land-

marks and unsupervised geometric constraints,” IEEE Transactions onRobotics, vol. 31, no. 3, pp. 736–749, 2015.

[17] Y. Yang and G. Huang, “Observability analysis of aided ins withheterogeneous features of points, lines and planes,” arXiv preprintarXiv:1805.05876, 2018.

[18] F. Nardi, B. Della Corte, and G. Grisetti, “Unified representationand registration of heterogeneous sets of geometric primitives,” IEEERobotics and Automation Letters, vol. 4, no. 2, pp. 625–632, 2019.

[19] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W.Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,”The International Journal of Robotics Research, vol. 35, no. 10, pp.1157–1163, 2016.

[20] G. Zhang, J. H. Lee, J. Lim, and I. H. Suh, “Building a 3-D line-basedmap using stereo SLAM,” IEEE Transactions on Robotics, vol. 31,no. 6, pp. 1364–1377, 2015.

[21] L. P. Chew, “Constrained delaunay triangulations,” Algorithmica,vol. 4, no. 1-4, pp. 97–108, 1989.

[22] S. Agarwal, K. Mierle, and Others, “Ceres solver,” http://ceres-solver.org.

[23] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs,R. Wheeler, and A. Y. Ng, “Ros: an open-source robot operatingsystem,” in ICRA workshop on open source software. Kobe, Japan,2009, p. 5.

[24] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “Abenchmark for the evaluation of rgb-d slam systems,” in IntelligentRobots and Systems (IROS), 2012 IEEE/RSJ International Conferenceon. IEEE, 2012, pp. 573–580.

[25] T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler,M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark withhigh-resolution images and multi-camera videos,” in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition,2017, pp. 3260–3269.