1
Monocular SFM Robust Scale Estimation in Real-Time Monocular SFM for Autonomous Driving Shiyu Song and Manmohan Chandraker, NEC Laboratories America Motivation Results Results in the KITTI Visual Odometry Dataset Method Scale Drift Correction using Ground Plane Estimation sd 1 sd 2 sh Use height from ground to resolve scale s (R, st) We can compute both 3D points and camera motion, up to unknown scale factor. Approach: Adaptive Cue Fusion Cue 1: Dense Stereo = ( 1 , 2 , 3 ) Frame k Frame k+1 β„Ž Camera Ground Plane Homography Mapping (1 βˆ’ SAD ) ,β„Ž min Cue 2: 3D Points R, T Frame k Frame k+1 = ( 1 , 2 , 3 ) β„Ž Camera Ground Plane Ground Plane Camera β„Ž z y q= { exp(βˆ’βˆ†β„Ž 2 ) β‰  } max Cue 3: Object β„Ž = ( 1 , 2 , 3 ) Object height estimation min 3 (β„Ž βˆ’β„Ž ) 2 Highlights β€’ Highly accurate monocular SFM with comparable accuracy to stereo. β€’ Ground plane estimation through different cues, such as dense stereo, 3D points and object detection. β€’ A novel data-driven framework that adaptively combines multiple cues based on per-frame observation covariances estimated by rigorously trained models. β€’ Scale drift is corrected by the optimal fused estimates, which yields high accuracy and robustness. LIDAR Stereo Camera Monocular Camera Cost $70,000 $1,000 $100 Maintenance Hard Hard Easy Comparisons of LIDAR, stereo and monocular System KITTI Visual Odometry Benchmark Examples of 1D Gaussian fits Learned linear model to relate the observation variance to the underlying variable, the Gaussian variance Feature matching and triangulation (a) Histogram of data. (b) Learned linear model to relate the observation variance to the underlying variable, q. min , ,Ξ£ ( βˆ’ 1 2 ( βˆ’ )Ξ£ βˆ’1 ( βˆ’ ) =1 βˆ’ ) 2 =1 Ξ£ = 2 βˆ— βˆ— βˆ— βˆ— 2 βˆ— βˆ— βˆ— βˆ— 2 βˆ— βˆ— βˆ— βˆ— β„Ž 2 , Dense Stereo 3D Points Trained Model 1 Trained Model 2 β„Ž, ,β„Ž β„Ž, ,β„Ž Object Detection Trained Model 2 3 , , 3 3 , , 3 β„Ž, ,β„Ž 1 , , 1 3 , , 3 β„Ž, ,β„Ž 1 , , 1 3 , , 3 = β„Ž + β„Ž (a) Histogram of data. (b) Learned linear model to relate the observation variance to the underlying variable, . Top Monocular Performance Summary Camera Center Ambiguous 3D Points Image Plane 2D Image Scale Ambiguity Challenge Stereo Systems Our Monocular System Prior Monocular System Fig. Height error relative to ground truth. The effectiveness of our data fusion is shown by less spikiness in the filter output and a far lower error. Fig. (a) Average error in ground plane estimation. (b) Percent number of frames where height error is lower than 7%. Ground Height Object Localization Comparison of 3D object localization errors for calibrated ground, stereo cue only, fixed covariance fusion and adaptive covariance fusion of stereo and detection cues. Adaptive Fusion β„Ž z y

Robust Scale Estimation in Real-Time Monocular SFM for ...Β Β· Approach: Adaptive Cue Fusion Cue 1: Dense Stereo 𝒏=(𝑛1,𝑛2,𝑛3)𝑇 Frame k Frame k+1 β„Ž Camera Ground Plane

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Robust Scale Estimation in Real-Time Monocular SFM for ...Β Β· Approach: Adaptive Cue Fusion Cue 1: Dense Stereo 𝒏=(𝑛1,𝑛2,𝑛3)𝑇 Frame k Frame k+1 β„Ž Camera Ground Plane

Monocular SFM

Robust Scale Estimation in Real-Time Monocular SFM for Autonomous Driving Shiyu Song and Manmohan Chandraker, NEC Laboratories America

Motivation

Results

Results in the KITTI Visual Odometry Dataset

Method

Scale Drift Correction using

Ground Plane Estimation

sd1

sd2

sh

Use height from ground

to resolve scale s

(R, st)

We can compute both 3D points and camera motion,

up to unknown scale factor.

Approach: Adaptive Cue Fusion

Cue 1: Dense Stereo

𝒏 = (𝑛1, 𝑛2, 𝑛3)𝑇

Frame k Frame k+1

β„Ž Camera

Ground Plane

Homography Mapping

(1 βˆ’ πœ‚SAD)𝒏,β„Žmin

Cue 2: 3D Points

R, T

Frame k Frame k+1

𝒏 = (𝑛1, 𝑛2, 𝑛3)𝑇

β„Ž

Camera

Ground Plane

Ground Plane

Camera

β„Ž πœƒ

z y

q = { exp (βˆ’πœ‡βˆ†β„Žπ‘–π‘—2 )

𝑗≠𝑖

}𝑖 max

Cue 3: Object

β„Ž

𝒏 = (𝑛1, 𝑛2, 𝑛3)𝑇

𝒑𝒃

𝒑𝒕

Object height estimation

min𝑛3(β„Žπ‘ βˆ’ β„Ž 𝑏)

2

Highlights

β€’ Highly accurate monocular SFM with comparable accuracy to stereo.

β€’ Ground plane estimation through different cues, such as dense stereo, 3D points and object detection.

β€’ A novel data-driven framework that adaptively combines multiple cues based on per-frame observation covariances estimated by rigorously trained models.

β€’ Scale drift is corrected by the optimal fused estimates, which yields high accuracy and robustness.

LIDAR Stereo

Camera Monocular

Camera

Cost $70,000 $1,000 $100

Maintenance Hard Hard Easy

Comparisons of LIDAR, stereo and monocular System

KITTI Visual Odometry Benchmark

Examples of 1D Gaussian fits

Learned linear model to relate the observation variance to the underlying variable, the Gaussian variance

Feature matching and triangulation

(a) Histogram of data. (b) Learned linear model to relate the observation variance to the underlying variable, q.

minπ΄π‘š,ππ‘š,Ξ£π‘š

( π΄π‘šπ‘’βˆ’12(π’ƒπ‘›βˆ’ππ‘š)Ξ£π‘š

βˆ’1(π’ƒπ‘›βˆ’ππ‘š)

𝑀

π‘š=1

βˆ’ 𝛼𝑛)2

𝑁

𝑛=1

Ξ£π‘š =

𝜎π‘₯2 βˆ— βˆ— βˆ—

βˆ— πœŽπ‘¦2 βˆ— βˆ—

βˆ— βˆ— πœŽπ‘€π‘2 βˆ—

βˆ— βˆ— βˆ— πœŽβ„Žπ‘2

,

Dense Stereo

3D Points

Trained Model 1

Trained Model 2

β„Ž, π‘Žπ‘,β„Ž β„Ž, 𝑒𝑝,β„Ž

Object Detection

Trained Model 2

𝑛3, π‘Žπ‘‘,𝑛3 𝑛3, 𝑒𝑑,𝑛3

β„Ž, π‘Žπ‘ ,β„Ž

𝑛1, π‘Žπ‘ ,𝑛1

𝑛3, π‘Žπ‘ ,𝑛3

β„Ž, 𝑒𝑠,β„Ž 𝑛1, 𝑒𝑠,𝑛1

𝑛3, 𝑒𝑠,𝑛3

π‘Žπ‘‘π‘˜ =πœŽπ‘¦πœŽβ„Žπ‘πœŽπ‘¦ + πœŽβ„Žπ‘

(a) Histogram of data. (b) Learned

linear model to relate the observation

variance to the underlying variable, π‘Žπ‘‘π‘˜.

Top Monocular Performance Summary

Camera

Center

Ambiguous

3D Points

Image Plane

2D Image

Scale

Ambiguity

Challenge

Stereo

Systems

Our Monocular

System

Prior Monocular

System

Fig. Height error relative to ground truth. The effectiveness of our data fusion is shown by less spikiness in the filter output and a far lower error.

Fig. (a) Average error in ground plane estimation. (b) Percent number of frames where height error is lower than 7%.

Ground Height

Object Localization

Comparison of 3D object localization errors for calibrated ground, stereo cue only, fixed covariance fusion and adaptive covariance fusion of stereo and detection cues.

Adaptive Fusion

β„Ž πœƒ

z y