Robust Scale Estimation in Real-Time Monocular SFM for ... · Approach: Adaptive Cue Fusion Cue 1: Dense Stereo 𝒏=(𝑛1,𝑛2,𝑛3)𝑇 Frame k Frame k+1 ℎ Camera Ground Plane

Monocular SFM

Robust Scale Estimation in Real-Time Monocular SFM for Autonomous Driving Shiyu Song and Manmohan Chandraker, NEC Laboratories America

Motivation

Results

Results in the KITTI Visual Odometry Dataset

Method

Scale Drift Correction using

Ground Plane Estimation

sd1

sd2

sh

Use height from ground

to resolve scale s

(R, st)

We can compute both 3D points and camera motion,

up to unknown scale factor.

Approach: Adaptive Cue Fusion

Cue 1: Dense Stereo

𝒏 = (𝑛1, 𝑛2, 𝑛3)𝑇

Frame k Frame k+1

ℎ Camera

Ground Plane

Homography Mapping

(1 − 𝜂SAD)𝒏,ℎmin

Cue 2: 3D Points

R, T

Frame k Frame k+1

𝒏 = (𝑛1, 𝑛2, 𝑛3)𝑇

ℎ

Camera

Ground Plane

Ground Plane

Camera

ℎ 𝜃

z y

q = { exp (−𝜇∆ℎ𝑖𝑗2 )

𝑗≠𝑖

}𝑖 max

Cue 3: Object

ℎ

𝒏 = (𝑛1, 𝑛2, 𝑛3)𝑇

𝒑𝒃

𝒑𝒕

Object height estimation

min𝑛3(ℎ𝑏 − ℎ 𝑏)

2

Highlights

• Highly accurate monocular SFM with comparable accuracy to stereo.

• Ground plane estimation through different cues, such as dense stereo, 3D points and object detection.

• A novel data-driven framework that adaptively combines multiple cues based on per-frame observation covariances estimated by rigorously trained models.

• Scale drift is corrected by the optimal fused estimates, which yields high accuracy and robustness.

LIDAR Stereo

Camera Monocular

Camera

Cost $70,000 $1,000 $100

Maintenance Hard Hard Easy

Comparisons of LIDAR, stereo and monocular System

KITTI Visual Odometry Benchmark

Examples of 1D Gaussian fits

Learned linear model to relate the observation variance to the underlying variable, the Gaussian variance

Feature matching and triangulation

(a) Histogram of data. (b) Learned linear model to relate the observation variance to the underlying variable, q.

min𝐴𝑚,𝝁𝑚,Σ𝑚

( 𝐴𝑚𝑒−12(𝒃𝑛−𝝁𝑚)Σ𝑚

−1(𝒃𝑛−𝝁𝑚)

𝑀

𝑚=1

− 𝛼𝑛)2

𝑁

𝑛=1

Σ𝑚 =

𝜎𝑥2 ∗ ∗ ∗

∗ 𝜎𝑦2 ∗ ∗

∗ ∗ 𝜎𝑤𝑏2 ∗

∗ ∗ ∗ 𝜎ℎ𝑏2

,

Dense Stereo

3D Points

Trained Model 1

Trained Model 2

ℎ, 𝑎𝑝,ℎ ℎ, 𝑢𝑝,ℎ

Object Detection

Trained Model 2

𝑛3, 𝑎𝑑,𝑛3 𝑛3, 𝑢𝑑,𝑛3

ℎ, 𝑎𝑠,ℎ

𝑛1, 𝑎𝑠,𝑛1

𝑛3, 𝑎𝑠,𝑛3

ℎ, 𝑢𝑠,ℎ 𝑛1, 𝑢𝑠,𝑛1

𝑛3, 𝑢𝑠,𝑛3

𝑎𝑑𝑘 =𝜎𝑦𝜎ℎ𝑏𝜎𝑦 + 𝜎ℎ𝑏

(a) Histogram of data. (b) Learned

linear model to relate the observation

variance to the underlying variable, 𝑎𝑑𝑘.

Top Monocular Performance Summary

Camera

Center

Ambiguous

3D Points

Image Plane

2D Image

Scale

Ambiguity

Challenge

Stereo

Systems

Our Monocular

System

Prior Monocular

System

Fig. Height error relative to ground truth. The effectiveness of our data fusion is shown by less spikiness in the filter output and a far lower error.

Fig. (a) Average error in ground plane estimation. (b) Percent number of frames where height error is lower than 7%.

Ground Height

Object Localization

Comparison of 3D object localization errors for calibrated ground, stereo cue only, fixed covariance fusion and adaptive covariance fusion of stereo and detection cues.

Adaptive Fusion

ℎ 𝜃

z y

Documents

Robust Scale Estimation in Real-Time Monocular SFM for ... · Approach: Adaptive Cue Fusion Cue 1: Dense Stereo 𝒏=(𝑛1,𝑛2,𝑛3)𝑇 Frame k Frame k+1 ℎ Camera Ground Plane