Aggressive Quadrotor Flight Using Dense Visual-Inertial Fusion · Aggressive Quadrotor Flight Using Dense Visual-Inertial Fusion Yonggen Ling, Tianbo Liu, and Shaojie Shen Abstract—In

Aggressive Quadrotor Flight Using Dense Visual-Inertial Fusion

Yonggen Ling, Tianbo Liu, and Shaojie Shen

Abstract— In this work, we address the problem of aggressiveflight of a quadrotor aerial vehicle using cameras and IMUsas the only sensing modalities. We present a fully integratedquadrotor system and demonstrate through online experimentthe capability of autonomous flight with linear velocities upto 4.2m/s, linear accelerations up to 9.6m/s2, and angularvelocities up to 245.1 degree/s. Central to our approach is adense visual-inertial state estimator for reliable tracking ofaggressive motions. An uncertainty-aware direct dense visualtracking module provides camera pose tracking that takesinverse depth uncertainty into account and is resistant tomotion blur. Measurements from IMU pre-integration andmulti-constrained dense visual tracking are fused probabilis-tically using an optimization-based sensor fusion framework.Extensive statistical analysis and comparison are presented toverify the performance of the proposed approach. We alsorelease our code as open-source ROS packages.

I. INTRODUCTION

Autonomous micro aerial vehicles (MAVs) have low-costand superior mobility advantages, making them ideal roboticplatforms for a wide range of applications such as aerial pho-tography, surveillance, and search and rescue. These aerialrobots are able to, in principle, navigate quickly through 3-D unstructured environments and provide fast response inhazardous environments that are dangerous or inaccessiblefor humans. The ability to achieve high-speed and aggressiveflight is essential in such time-critical missions. However,there are still significant research and engineering challengesdue to the lack of GPS measurements and potential sensorfailures during fast motions. In particular, reliable stateestimation using lightweight, off-the-shelf sensors is still theforemost important component for aggressive autonomousflight. Inspired by our earlier results towards vision-basedhigh-speed quadrotor flight [1], in this work, we take an-other step forward using advanced methods in dense visualtracking and optimization-based sensor fusion to achieverobust state estimation. We show that, with only off-the-shelf cameras and IMUs as sensing modalities, we are ableto achieve autonomous flight with linear velocities up to4.2 m/s, linear accelerations up to 9.6 m/s2, and angularvelocities up to 245.1 degree/s.

As demonstrated in a wide body of literature, cameras arethe ideal sensor for tracking of slow to moderate motionsusing feature-based methods. However, motion blur caused

All authors are with the Department of Electronic and ComputerEngineering, The Hong Kong University of Science and Technology,Hong Kong, China. {ylingaa, tliuam}@connect.ust.hk,[email protected]

This work was supported by HKUST project R9341. The authors wouldalso like to thank the research donation and equipment support from DJI.

by aggressive motions can seriously downgrade feature de-tection and tracking performance. Recent advances in directdense tracking have shown good adaptability to motion bluror textureless environments [2]–[4]. These methods directlyoperate on image intensities and make full use of all theavailable information within an image. Bypassing the featureprocessing pipeline eliminates some of the issues in feature-based methods. However, even dense methods are subject tofailure during aggressive motions as images can be severelyblurred. IMUs generate noisy but outlier-free measurements,making them great for tracking of short-term fast motions.However, low-cost MEMS IMUs suffer significant drift inthe long term. We believe that combining the complementarynatures of dense visual tracking and IMU measurementsopens up the possibility of reliable tracking of aggressivemotions.

Fig. 1. Our experimental platform is the DJI M100 quadrotor equippedwith an onboard computer (Intel i5-4250U 1.3GHz dual-core CPU) and aVI-Sensor (forward-facing stereo cameras and a MEMS IMU). The platformweights 3210g in total.

Fig. 2. Snapshots during aggressive autonomous flight experiment. Wehighlight the position of the robot with red circles. The maximum linearvelocity, linear acceleration and angular velocity are 4.2m/s, 9.6m/s2 and245.1 degree/s respectively.

The key contribution of this work is a robust andfully integrated real-time solution for aggressive quadrotorflight. Our method uses the information from a pair ofcalibrated stereo cameras and a MEMS IMU and runsonboard a moderate computer (Fig. 1). The focus of

this work is an optimization-based probabilistic estimationmethod that fuses pre-integrated IMU measurements andmulti-constrained relative pose measurements from a depthuncertainty-aware dense visual tracking module. Our esti-mator actively searches for multi-constrained dense align-ments between frames within a sliding window. This loop-closure-like method enables the estimator to recover fromcomplete loss of visual tracking and eliminate drift after veryaggressive motions. In addition, we initialize the incrementalrotation for dense tracking using the angular prior fromIMU measurements, which greatly improves the convergenceproperty during aggressive motions. We release our codeas open-source ROS packages with relevant video demon-strations. There are available at: https://github.com/ygling2008/dense_new.

The estimator of our proposed system is an extension ofour preliminary work [4], with improvements in uncertainty-aware dense tracking and robust graph-based optimization.We believe that we are the first to introduce a practicalvisual-inertial system for aggressive autonomous flight ofquadrotors.

The rest of the paper is structured as follows. In Sect. II,we review the state-of-the-art scholarly work. An overviewof the system is presented in Sect. III. Details of the systemare discussed in Sect. IV. Implementation details and exper-imental evaluations are presented in Sect. V. Sect VI drawsthe conclusions and points out possible future extensions.

II. RELATED WORK

There has been extensive scholarly work in visual-inertialstate estimation. Visual measurements can be calculated fromdifferent camera configurations, such as monocular [1, 5]–[8], stereo [9], or RGB-D cameras [10]. The majority of theseapproaches rely on detecting and tracking of sparse featuresacross multiple frames. Though feature-based methods arewell-developed, they depend heavily on image quality andare subject to failure when cameras undergo aggressivemotions that lead to severe motion blur. With recent advancesin high-performance mobile computing, direct dense methodshave become popular [2, 3, 11]–[13]. By directly minimiz-ing the photometric intensity error between images, thesemethods eliminate the feature processing pipeline, makingthem more resistant to image blur, provided that two imagesare about equally blurred.

It is straightforward to apply some variations of Kalmanfiltering [1, 7, 10] to loosely fuse visual and inertial measure-ments. The high-level effect of such fusion is the smoothingof vision-based tracking, and the use of IMUs for short-term motion prediction when visual tracking fails. In loosely-coupled methods, visual measurements are usually presentedin the form of relative pose transformations, while leavingthe visual pose tracking as a black box. This leads to lowercomputational complexity at the cost of suboptimal results.Recent developments in visual-inertial fusion indicate thattightly-coupled methods outperform their loosely-coupledcounterparts in terms of estimation accuracy [5, 6, 8, 9, 14].

But this comes at the cost of higher computational complex-ity.

While dense methods are well-established in vision-onlysettings using RGB-D [11, 13], stereo [15] and monocu-lar [3, 12] cameras, few visual-inertial fusion approachesincorporates dense tracking. Most similar to our work is [16],where dense tracking results are loosely fused with inertialmeasurements using an extended Kalman filter. No multi-constrained measurements or graph-based optimization areused in [16], thus limiting success in tracking failure recov-ery. We show through online experiment that our approachoutperforms that in [16] during aggressive motions.

III. SYSTEM OVERVIEW

The pipeline of our proposed system is illustrated inFig. 3. Six threads run simultaneously, utilizing the multi-core architecture.

The first thread is the driver thread, which performs basicoperations, such as data acquisition and image rectification.

The dense tracking thread performs keyframe-to-framedirect dense tracking while taking the pixel disparity noiseinto account. The angular prior from integration of gyro-scope measurements initializes the incremental rotation. Thisthread also identifies instantaneous tracking performance,detects tracking failure and determines whether to add a newkeyframe. A disparity map is computed using the standardblock matching algorithm if a new keyframe is added. Thevisual measurements and their corresponding frames arestored in a frame list buffer for further processing by theoptimization thread (Sect. IV-C).

The optimization thread maintains a sliding window ofstates and measurements, and checks the frame list bufferperiodically. If it is not empty, all the frames within the bufferwill be added into the sliding window. If a keyframe is added,loop-closure detection is performed to find possible visualconnections between keyframes. Graph-optimization is thenapplied to find the maximum a posteriori estimate of all thestates within the sliding window using measurements fromIMU pre-integration (Sect. IV-B), multi-constrained relativepose measurements (Sect. IV-D) and the prior. A two-waymarginalization scheme that selectively removes states isused in order to both bound the computational complexity,IMU integration time and maximize the information storedwithin the sliding window (Sect. IV-F).

The latest state information after graph-based optimizationis sent to the unscented Kalman filter (UKF)-based smooth-ing thread, which generates high-rate state estimation forquadrotor control (Sect. IV-H).

The trajectory generation and control thread (Sect. IV-I) generates a smooth trajectory once the user specifieswaypoints, maximum velocities and maximum accelerations.After that, this thread gives commands to the controller.

The controller thread receives commands from the trajec-tory generation and control thread and state information fromthe UKF-based smoothing thread. It then calculates the thrustand the desired attitude accordingly. Finally, commands are

https://github.com/ygling2008/dense_new

https://github.com/ygling2008/dense_new

Fig. 3. The pipeline of our proposed system. It consists of six modules running in separate threads to ensure real-time availability of state estimates. Thedriver thread runs at 200Hz, dense tracking thread runs at 25Hz, optimization thread runs at 25Hz, UKF-based smoothing thread runs at 100Hz, trajectorycontrol and generation thread runs at 100Hz, and controller thread runs at 50Hz.

sent by the controller to the quadrotor for execution (Sect. IV-J).

IV. OPTIMIZATION-BASED SENSOR FUSION

We consider (·)k as the camera frame, while taking thekth image, and (·)b as the instantaneous IMU body frame.Without loss of generality, we assume that the camerasand the IMU are aligned. The camera-IMU sensor suiteis rigidly mounted, with intrinsic and extrinsic parameterscalibrated beforehand. pXY , vXY and RX

Y are the 3D posi-tion, velocity and rotation of camera frame Y with respectto frame X . We also have the corresponding quaternion(qXY = [qx, qy, qz, qw]) representation for rotation. Hamiltonnotation is used for quaternions.

Given two time instants that correspond to two images, wecan write the IMU propagation model for position, velocityand rotation with respect to the first state of the system asin [17]:

p0k+1 = p0

k + R0kv

k∆t− g0∆t2/2 + R0kα

kk+1

vk+1 = Rk+1k (vk + βkk+1 −Rk

0g0∆t)

q0k+1 = q0

k ⊗ qkk+1,

(1)

where ∆t is the interval between two image acquisitions,and g0 is the gravity vector expressed in the first state of thesystem. αkk+1 and βkk+1 can be obtained by integrating theIMU measurements between time instants k and k+ 1, withthe definition detailed in Sect. IV-B.

A. State Estimation Formulation

We set the position and rotation of the first state to bezero. The initial velocity and gravity vector can be obtainedusing the online initialization method presented in [18]. Thefull state vector is defined as

X = [x00,x

01, ...,x

0N ]

x0k = [p0

k,vk,q0

k] (2)

p00 = [0, 0, 0], q0

0 = [0, 0, 0, 1].

We aim to obtain a maximum a posteriori (MAP) estimateby minimizing the sum of the Mahalanobis norm of theweighted visual measurement residuals, inertial measurementresiduals and the prior:

minX||rp −HpX||2 +

∑k∈Si

||rSi(zkk+1,X )||2Pk

k+1+ (3)∑

(i,j)∈Sc

||rSc(zji ,X )||2(Wj

i )−1Pji

where Hp and rp are the prior matrix and prior residualvector respectively, and Si and Sc are the set of inertial andvisual measurements respectively. rSi

(zkk+1,X ) is the resid-ual function that measures the residual between the inertialmeasurements and X with covariance Pk

k+1, and rSc(zji ,X )is the residual function that measures the reprojection errorbetween visual measurements and X with covariance Pj

i .Since visual measurements are subject to failure, we add adiagonal matrix Wj

i to weight the influence of rSc(zji ,X ).

Inertial measurements are obtained by IMU pre-integration(Sect. IV-B) and visual measurements are obtained by multi-constrained dense alignments (Sect. IV-D).

B. IMU Pre-integration

In this work, we adopt our proposed IMU preintegrationmethod [8] in this work, which summarizes IMU mea-surements in a local frame and gets rid of the need forre-integration when the linearization point changes. Theintegration from IMU measurements between time instantsk and k + 1 is

zkk+1 =

αkk+1

βk

k+1

qkk+1

=

∫∫t∈[k,k+1]

Rkt abtdt

2∫t∈[k,k+1]

Rkt abtdt∫

t∈[k,k+1]12

[−⌊ωbt×

⌋ωbt

−ωbtT 0

]qkt dt

,(4)

where abt and ωbt are the instantaneous linear accelerationand angular velocity in the IMU body frame at time t

respectively. The residual function between the states andthe IMU measurement is defined as

rSi(zkk+1,X ) =

δαkk+1

δβkk+1

δθkk+1

=

Rk0(p0

k+1 − p0k + g0 ∆t2

2 )− vk∆t− αkk+1

Rk0(R0

k+1vk+1 + g0∆t)− vk − β

k

k+1

2[(qkk+1)−1(q0k)−1q0

k+1]xyz

.(5)

The covariance Pkk+1 can be calculated by iteratively lin-

earizing the continuous-time dynamics of the error term andthen updating it with discrete-time approximation. Detailedderivation can be found in [8].

C. Uncertainty-aware Dense Tracking

In the uncertainty-aware dense tracking module, we as-sume that image brightness is the same for the referenceframe i and tracked frame j, and calculate the rigid-bodytransformation, denoted as Tj

i = {tji ,Rji} ∈ SE(3), where

tji and Rji are translation and rotation respectively, that

minimizes the intensity differences and takes inverse depthuncertainty into account:

Tji

∗= arg min

Tji

∫∫1

σ2u

δI(Tji ,u, ρu)du (6)

δI(Tji ,u, ρu) = Ii[u]− Ij [π(Rj

iπ−1(u, ρu) + tji )], (7)

where u is the coordinate of a pixel, ρu is the inverse depthof pixel u, σ2

u is the covariance of δI(Tji ,u, ρu), Ik[u] is the

intensity value of pixel u in image k with covariance σ2Ik

,π(·) : R3 → R2 is the projection function that projects a 3-Dpoint f = [x, y, z]T into the image coordinate u = (u, v),and π−1(·) is the inverse projection function.

Since the inverse depth map is calculated using the blockmatching algorithm, i.e., ρu = sλu, where λu is the disparityof pixel u and s is a scaling constant related to the length ofthe stereo baseline and the focus length of the camera, wemodel the noise of λu as zero-mean white Gaussian withcovariance σ2

λu. The covariance of ρu is

σ2ρu = s2σ2

λu. (8)

We assume that the error sources are independent andapproximate σ2

u with first-order covariance propagation

σ2u = JTu

σ2Ii

0 0 00 σ2

Ij0 0

0 0 σ2Tj

i

0

0 0 0 σ2ρu

Ju, (9)

where Ju is the Jacobian of δI(Tji ,u, ρu) with respect to

Ii, Ij , Tji and ρu.

We adopt the Gauss-Newton approach on the Lie-manifolds to solve (6), which iteratively re-linearizes (6)around the current estimate Tj

i and then performs incremen-tal update until convergence:

Tji ← Tj

i ⊗ exp(ξ), (10)

where ξ = (δtji , δθji ) ∈ se(3) is the minimum dimension

error state. More details about Lie algebra se(3) and Liegroup SE(3) can be found in [19].

Augmenting (6) with linearization leads to the followinglinear system:

JTWJξ = JTWr, (11)

where J is a Jacobian matrix formed by stacking Jacobiansof the image intensity differences (6) with respect to ξ, r isthe corresponding intensity differences vector, and W is adiagonal matrix that encodes the uncertainty.

In the real implementation, we adopt image pyramids toincrease the convergence region and handle large movements.Only pixels with noticeable gradients are used, for efficiencyreasons. Moreover, the angular prior from integration of theIMU instantaneous angular velocity is used for initializingthe incremental rotation.

The visual measurement zji in (3) is zji = Tji

∗, and the

residual function is defined as

rSc(zji ,X ) =

[δtjiδθji

]=

[Rj

0(p0i − p0

j )− tji2[(qji )

−1(q0j )−1q0

i ]xyz

], (12)

where qji is the quaternion representation of Rji . It can be

derived mathematically that the corresponding covariancePji is the inverse of the Hessian matrix JTWJ at the final

iteration.

D. Multi-constrained Dense AlignmentsUncertainty-aware dense tracking is performed between

the latest keyframe within the sliding window and theincoming frame once a new frame comes. If the trackingis successful, a visual measurement and its correspondingcovariance are inserted into the sliding window.

In addition, since significant drift may occur after ag-gressive motions, we introduce a local loop-closure modulefor recovery. Once a new keyframe is added, loop-closuredetection is performed to seek possible visual measurementsbetween existing keyframes within the sliding window andthe new keyframe using uncertainty-aware dense tracking.Note that a cross check is adopted to avoid incorrect loopclosure. If the two corresponding estimated rigid-body trans-formations are consistent, the cross check is successfullypassed.

E. Dense Tracking Failure DetectionAlthough our proposed uncertainty-aware dense tracking

takes the inverse depth uncertainty into account and increasesthe robustness compared to the traditional methods, it stillfails in extreme cases, such as during aggressive motionswithin textureless surroundings. Detection of dense trackingfailure is of vital importance to our system. As shown inSect. IV-C, covariance matrix (JTWJ)−1 tells us about thedense tracking performance. We consider the dense trackingto be a failure if the covariance is greater than a certainthreshold. Also, since IMU measurements are reliable in theshort-term, dense tracking is considered to have failed if itsincremental transformation estimation is not consistent withthe prior from the IMU integration.

F. Two-way Marginalization

A two-way marginalization scheme [18] is used to main-tain a sliding window of states and convert measurementscorresponding to the marginalized states into a prior. Sincethe memory and computational resources of our system arelimited, we can only optimize a certain number of states andvisual or inertial measurements for real-time performancesetting. The effectiveness of multi-constrained dense align-ment and drift elimination depend on whether an older stateis kept within the sliding window. We additionally need toensure that the time interval for each IMU preintegrationis bounded in order to bound the accumulated error. Bytwo-way marginalization, all information of the removedstates is kept and computation complexity is bounded,which is fundamentally different from traditional keyframe-based approaches that simply drop non-keyframes. Frontmarginalization removes the second newest frame, whileback marginalization removes the oldest keyframe within thesliding window. In this work, to preserve all the information(visual and inertial measurements) related to non-keyframes,we only perform front marginalization after the newest statecomes.

Note that the criterion to select whether to use front orback marginalization is based on the result from the densetracking failure detection module introduced in Sect. IV-E.The second newest state will be marginalized in the nextround if the dense tracking is good and the second neweststate is near to the current keyframe. Otherwise, the oldeststate will be marginalized. The distance is thresholded bya weighted combination of translation and rotation betweenthe latest keyframe and the second newest frame.

G. Optimization with Robust Norm

Based on the residual functions defined in (5) and (12),we operate on the error state and optimize (3) using theGaussian-Newton method, which iteratively minimizes:

minδX

||rp −HpX||2 +∑k∈Si

||rSi(zkk+1,X ) + Hk

k+1δX||2Pkk+1

(13)

+∑

(i,j)∈Sc

||rSc(zji ,X ) + Hji δX||

2(Wj

i )−1Pji

and then updates

X = X ⊕ δX (14)

until convergence. Hkk+1 and Hj

i are the Jacobian matrices ofinertial measurements and visual measurements with respectto the states.

To increase the robustness of our proposed system, Wji

changes in each iteration to further eliminate possible outliersin dense tracking that pass the failure detection (Sect. IV-E).Wj

i is computed according to the Huber thresholding on the

current estimate:

(Wji )ul =

{I3×3, if ||Rj

0(p0j − p0

i )− tji || ≤ ctct

||Rj0(p0

j−p0i )−tji ||

I3×3, otherwise

(15)

(Wji )lr =

{I3×3, if ||2[(qji )

−1(q0j )−1q0

i ]xyz|| ≤ caca

||2[(qji )−1(q0

j )−1q0i ]xyz||

I3×3, otherwise

(16)

where (Wji )ul is the upper left 3×3 matrix of Wj

i , (Wji )lr

is the lower right 3 × 3 matrix of Wji , I3×3 is an identity

matrix, and ct and ca are the given translation and angularthreshold respectively.

H. UKF-based Smoothing

The 25Hz state estimation from the optimization modulealone is not sufficient to control the quadrotor. The flightcontroller with a separate IMU streams out measurements at100 Hz. As a result, we employ an UKF to smooth posesand velocities at 100 Hz [20]. The inputs to the UKF-basedsmoothing thread are absolute poses and velocities withrespect to the zero frame from the optimization module andthe IMU measurements from the flight controller. The statesin the UKF are x = [r, r,Ψ,ba] T, where Ψ = [φ, θ, ψ] T

represents the roll, pitch, and yaw angles of the quadrotorrespectively, ba =

[bax , bay , baz

]T is the accelerometer’s

bias expressed in the body frame, and r and r are thepositions and velocities of the quadrotor respectively.

I. Trajectory Generation

Given a set of expected way-points and the maximumvelocity and acceleration, we would like to generate a smoothtrajectory so that the quadrotor can follow it while moving asfast as possible. We utilize a polynomial trajectory generationalgorithm [21] that runs onboard.

J. Control

Fig. 4. Control pipeline.

The controller aims to control the quadrotor to followthe generated trajectory. Suppose the difference between thecurrent and expected (denoted with ∗) position and velocityis ep = rt − r∗t , ev = rt − r∗t . The force vector for thecontroller is:

F = −Kpep −Kvev +mgzW +mr∗, (17)

where Kp and Kv are the constants, and zW is the verticalaxis in the world [22]. The state information from the UKFis used as the feedback of the controller. Fig. 4 showsthe pipeline of the control actions. The outputs from theoptimization thread are vaild odometry (tracking is good),keyframe odometry (new keyframe inserted) and invalid

odometry (tracking is bad, merely integration from iner-tial measurements is available). Normally, trajectory controltakes charge. However, if invalid odometry lasts more thantwo seconds, urgent control that adjusts the roll and pitchangle to be zero, yaw angle to be fixed and position heightto be fixed, will be enabled for safety reason.

V. EXPERIMENTS

The experiment platform is the Matrice 100 quadrotorfrom DJI1 with onboard SDK for execution of controlcommands (Fig. 1). Our visual-inertial sensor suite equippedin the platform is the VI-Sensor2, which consists of a MEMSIMU and two global shutter cameras with a fronto-parallelstereo configuration. A high-level computer, NUC fromIntel, with a low-power dual-core CPU i5-4250U running at1.3GHz and 16GB RAM is used for computing in our pro-posed system. The operating system installed on Intel NUCis Ubuntu 14.04. The mass of our platform is 3210g. All thealgorithms are developed in C++, with ROS as the interfacingrobotics middleware. The frequencies of the IMU data andstereo camera data are 200 Hz and 25 Hz respectively. Thecameras have factory pre-calibrated intrinsics and extrinsics.

Component Average Computation Time ThreadDriver 1ms 0

Dense Tracking 13ms 1Block Matching 8ms 1

Graph Optimization 6ms 2Two-way Marginalization 3ms 2

Local Loop-closure 30ms 2UKF-based Smoothing 1ms 3

Controller 1ms 4Trajectory Generation 3ms 5

Trajectory Control 1ms 5

TABLE IAVERAGE COMPUTATION TIME OF MAIN TIME-CONSUMING

COMPONENTS OF OUR PROPOSED SYSTEM.

A. Real-time Implementation

To achieve real-time performance, we set the finest resolu-tion for uncertainty-aware dense tracking to be 320×240 andthe number of pyramid levels to be 3. Our implementationof dense tracking is built on top of the open-source andvectorized implementation of [3]. The noticeable gradientthreshold of the uncertainty-aware dense tracking is 5 andthe sliding window size is 30. The block matching algorithmwe use is the simplest and fastest one in OpenCV (StereoBM). We empirically find that, though the disparity map isnot great using this native block matching algorithm, theperformance of our tracking module is quite good thanks tothe uncertainty-aware treatment. The computing time of eachcomponent is summarized in Table I.

1https://developer.dji.com/matrice-100/2http://www.skybotix.com/

B. Performance with Normal Speed Flights

In this experiment, we control the quadrotor to fly ata speed about 1 m/s and compare the estimation accuracyof our system with a keypoint-based method (fovis [10])and semi-dense tracking [12]. Since the method in [12]is for monocular vision, we modify it by initializing thedepth values using stereo block matching when a new key-frame is inserted instead of generating random ones or usingcontinuous depth fusion for fair comparison. The total flyingdistance is about 80 meters, as shown in Fig. 5. We find thatthere is occasionally tracking loss with fovis, and thereforethe final drift is large (thus we do not plot it). The finalposition drifts of semi-dense tracking and our system are0.45m and 0.23m respectively. Since the flying speed is notfast, both semi-dense tracking and our approach works welland have good performance. Moreover, our proposed systemfully utilizes the advance of multi-constrained relative posemeasurements and nonlinear optimization, resulting in lessdrift.

−5 0 5 10 15

0

2

4

6

8

10

12

X (meters)

Y (

met

ers)

stereo semi−denseproposed

Fig. 5. We compare the estimation accuracy of our proposed system withsemi-dense tracking [12] of quadrotor flights at normal speed.

C. Performance with Aggressive Flights

We demonstrate the comprehensive performance of theproposed system with aggressive motions by generatingexpected trajectories (figure-eight and cross pattern) andhaving the quadrotor follow them using feedback controlfrom onboard state estimates. The reason we adopt thefigure-eight and cross pattern is that the quadrotor will havelarge linear and angular changes that lead to aggressive mo-tions when following these two trajectories. The robustnessand smoothness of onboard motion estimation are of vitalimportance to this experiment. Snapshots of these flightsare shown in Fig. 2. We compare the real-time estimationaccuracy and robustness between our system and state-of-the-art methods, such as the keypoint-based method fovis [10],semi-dense tracking [12] (adapted to be stereo the same asin the last experiment) and the loosely-coupled dense VINSdense-EKF [16], and show that our system is the only one

https://developer.dji.com/matrice-100/

http://www.skybotix.com/

that is able to handle these challenging cases. Ground truthdata obtained by the OptiTrack Prime 41 system is providedfor comparison.

The trajectories of the ground truth and those from thedifferent approaches are shown in Fig. 6.(a) and Fig. 7.(a).The statistics of angular rates, linear velocities and linearaccelerations are shown in Fig. 6.(b) and Fig. 7.(b). Theangular rates are measured by IMU. Linear velocities andaccelerations are calculated by our proposed system. It canbe easily seen from the trajectories that both the methods in[10] and [12] work poorly during aggressive flight. We give adetailed visual comparison of angles and positions betweenour system, dense-EKF and the ground truth in Fig. 6.(c)and (d) and Fig. 7.(c) and (d). While the performance ofthe method [16] is good with the cross pattern, it is badwith the figure-eight. Our estimator is the best in termsof both accuracy and robustness for both trajectories. Thefinal statistics are summarized in Table II. Our method isthe only approach with observability of roll/pitch and multi-constrained visual measurements, resulting in least drift.Also, the optimization-based fusion scheme of our systemprovides accurate and smooth trajectories that benefit thecontroller.3

Trajectory figure-eight cross patternMaximum Angular Rate (degree/s) 183.9 245.1

Maximum Linear Velocity (m/s) 2.8 4.2Maximum Linear Acceleration (m/s2) 6.7 9.6

Position-x Drift (m) 0.09 0.07Position-y Drift (m) 0.03 0.03Position-z Drift (m) 0.24 0.09Yaw Drift (degree) 3.07 0.37

TABLE IISUMMARY OF OUR SYSTEM COMPARED TO GROUND TRUTH DATA.

Note that there are limits on maximum attitude with theSDK of the M100. We will test our approach with moreaggressive motions in the future if the platform hardwareimproves.

VI. CONCLUSIONS AND FUTURE WORK

We proposed a real-time, robust and fully integrated sys-tem for quadrotors to fly with aggressive motions. The coreof our system is an optimization-based probabilistic sensorfusion scheme that provides robust state estimation. Onlineexperiments verified the superior performance of our system.In the future, we will integrate obstacle avoidance modulesinto our system.

3Since the ground truth data is not good due to limited space andhardware, we do not include the standard derivation of linear velocitiesand roll/pitch in Table II.

0

2

4

6

−2−1

01

230

1

2

3

4

5

X (meters)Y (meters)

Z (

met

ers)

fovisstereo semi−densedense−ekfground truthproposed

(a) Trajectories

10 20 30 40 50 60 700

100

200

Time (sec)Ang

ular

Rat

e (d

egre

e/s)

10 20 30 40 50 60 700

2

4

Time (sec)

Line

ar V

eloc

ity (

m/s

)

10 20 30 40 50 60 70−10

0

10

Time (sec)Line

ar A

ccel

erat

ion

(m/s

2 )

(b) Statistics

10 20 30 40 50 60 70−0.5

0

0.5

Time (sec)

Pitc

h (r

ad)

10 20 30 40 50 60 70−0.5

0

0.5

Time (sec)

Rol

l (ra

d)

dense−ekfground truthproposed

10 20 30 40 50 60 70−0.2

0

0.2

Time (sec)

Yaw

(ra

d)

(c) Angles

10 20 30 40 50 60 70

0

2

4

Time (sec)

X (

met

ers)

10 20 30 40 50 60 70

−1

0

1

Time (sec)

Y (

met

ers)


10 20 30 40 50 60 70−1

0

1

2

3

Time (sec)

Z (

met

ers)

(d) Positions

Fig. 6. Performance of our proposed method compared with ground truthobtained by OptiTrack, fovis [10], semi-dense tracking [12] and dense-EKF[16]. (a) Estimated trajectory of figure-eight. (b) Statistics of angular rates,linear velocities and linear accelerations during the flight. (c) Comparisonof angles. (d) Comparison of positions. Ground truth data is not good dueto limited space and hardware.

−2

0

2

−2

−1

0

1

20

1

2

3

4

X (meters)Y (meters)

Z (

met

ers)

fovisstereo semi−densedense−ekfground truthproposed

(a) Trajectories

15 20 25 30 35 40 45 50 55 600

100

200

300

Time (sec)Ang

ular

Rat

e (d

egre

e/s)

15 20 25 30 35 40 45 50 55 600

5

Time (sec)

Line

ar V

eloc

ity (

m/s

)

15 20 25 30 35 40 45 50 55 60−10

0

10

Time (sec)Line

ar A

ccel

erat

ion

(m/s

2 )

(b) Statistics

15 20 25 30 35 40 45 50 55 60−0.5

0

0.5

Time (sec)

Pitc

h (r

ad)

15 20 25 30 35 40 45 50 55 60−0.5

0

0.5

Time (sec)

Rol

l (ra

d)


15 20 25 30 35 40 45 50 55 60−0.2

0

0.2

Time (sec)

Yaw

(ra

d)

(c) Angles

15 20 25 30 35 40 45 50 55 60−2

0

2

Time (sec)

X (

met

ers)

15 20 25 30 35 40 45 50 55 60−2

0

2

Time (sec)

Y (

met

ers)


15 20 25 30 35 40 45 50 55 60−1

0

1

2

Time (sec)

Z (

met

ers)

(d) Positions

Fig. 7. Performance of our proposed method compared with ground truthobtained by OptiTrack, fovis [10], semi-dense tracking [12] and dense-EKF [16]. (a) Estimated trajectory of the cross pattern. (b) Statistics ofangular rates, linear velocities and linear accelerations during the flight. (c)Comparison of angles. (d) Comparison of positions. Ground truth data isnot good due to limited space and hardware.

REFERENCES

[1] S. Shen, Y. Mulgaonkar, N. Michael, and V. Kumar, “Vision-basedstate estimation and trajectory control towards high-speed flight witha quadrotor,” in Proc. of Robot.: Sci. and Syst., Berlin, Germany, 2013.

[2] R. A. Newcombe, S. Lovegrove, and A. J. Davison, “DTAM: densetracking and mapping in real-time,” in IEEE International Conferenceon Computer Vision, 2011, pp. 2320–2327.

[3] J. Engel, T. Schops, and D. Cremers, “LSD-SLAM: Large-scale directmonocular SLAM,” in European Conference on Computer Vision(ECCV), September 2014.

[4] Y. Ling and S. Shen, “Dense visual-inertial odometry for tracking ofaggressive motions,” in Proc. of the IEEE Intl. Conf. on Robot. andBio., 2015.

[5] J. A. Hesch, D. G. Kottas, S. L. Bowman, and S. I. Roumeliotis, “Con-sistency analysis and improvement of vision-aided inertial navigation,”IEEE Trans. Robot., vol. 30, no. 1, pp. 158–176, Feb. 2014.

[6] M. Li and A. Mourikis, “High-precision, consistent EKF-based visual-inertial odometry,” Intl. J. Robot. Research, vol. 32, no. 6, pp. 690–711,May 2013.

[7] D. Scaramuzza, M. Achtelik, L. Doitsidis, F. Fraundorfer, E. Kos-matopoulos, A. Martinelli, M. Achtelik, M. Chli, S. Chatzichristofis,L. Kneip, D. Gurdan, L. Heng, G. Lee, S. Lynen, L. Meier, M. Polle-feys, A. Renzaglia, R. Siegwart, J. Stumpf, P. Tanskanen, C. Troiani,and S. Weiss, “Vision-controlled micro flying robots: from systemdesign to autonomous navigation and mapping in GPS-denied envi-ronments,” IEEE Robot. Autom. Mag., vol. 21, no. 3, 2014.

[8] S. Shen, N. Michael, and V. Kumar, “Tightly-coupled monocularvisual-inertial fusion for autonomous flight of rotorcraft MAVs,” inProc. of the IEEE Intl. Conf. on Robot. and Autom., Seattle, WA,May 2015.

[9] S. Leutenegger, P. Furgale, V. Rabaud, M. Chli, K. Konolige, andR. Siegwart, “Keyframe-based visual-inertial SLAM using nonlinearoptimization,” in Proc. of Robot.: Sci. and Syst., Berlin, Germany, June2013.

[10] A. S. Huang, A. Bachrach, P. Henry, M. Krainin, D. Maturana, D. Fox,and N. Roy, “Visual odometry and mapping for autonomous flightusing an RGB-D camera,” in Proc. of the Intl. Sym. of Robot. Research,Flagstaff, AZ, Aug. 2011.

[11] C. Kerl, J. Sturm, and D. Cremers, “Robust odometry estimation forrgb-d cameras,” in Proc. of the IEEE Intl. Conf. on Robot. and Autom.,May 2013.

[12] J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometry fora monocular camera,” in Proc. of the IEEE Intl. Conf. Comput. Vis.,Sydney, Australia, December 2013.

[13] C. Kerl, J. Sturm, and D. Cremers, “Dense visual slam for rgb-dcameras,” in Proc. of the IEEE/RSJ Intl. Conf. on Intell. Robots andSyst., 2013.

[14] M. K. G. Huang and J. Leonard, “Towards consistent visual-inertialnavigation,” in Proc. of the IEEE Intl. Conf. on Robot. and Autom.,Hong Kong, June 2014.

[15] A. I. Comport, E. Malis, and P. Rives, “Real-time quadrifocal visualodometry,” in Intl. J. Robot. Research, 2010.

[16] S. Omari, M. Bloesch, P. Gohl, and R. Siegwart, “Dense visual-inertialnavigation system for mobile robots,” in Proc. of the IEEE Intl. Conf.on Robot. and Autom., 2015.

[17] T. Lupton and S. Sukkarieh, “Visual-inertial-aided navigation for high-dynamic motion in built environments without initial conditions,”IEEE Trans. Robot., vol. 28, no. 1, pp. 61–76, Feb. 2012.

[18] S. Shen, Y. Mulgaonkar, N. Michael, and V. Kumar, “Initialization-freemonocular visual-inertial estimation with application to autonomousMAVs,” in Proc. of the Intl. Sym. on Exp. Robot., Morocco, 2014.

[19] Y. Ma, S. Soatto, J. Kosecka, and S. S. Sastry, An invitation to 3-d vision: from images to geometric models. Springer Science &Business Media, 2012, vol. 26.

[20] R. Van Der Merwe, E. A. Wan, S. Julier, et al., “Sigma-point kalmanfilters for nonlinear estimation and sensor-fusion: Applications tointegrated navigation,” in Proc. of AIAA Guidance, Navigation andControl Conference, 2004.

[21] N. R. Charles Richter, Adam Bry, “Polynomial trajectory planning forquadrotor flight,” in Workshop on the Intl. Conf. on Field and ServiceRobot., 2013.

[22] D. Mellinger and V. Kumar, “Minimum snap trajectory generation andcontrol for quadrotors,” in Proc. of the IEEE Intl. Conf. on Robot. andAutom., Shanghai, May 2011.

Documents

Aggressive Quadrotor Flight Using Dense Visual-Inertial Fusion · Aggressive Quadrotor Flight Using Dense Visual-Inertial Fusion Yonggen Ling, Tianbo Liu, and Shaojie Shen Abstract—In