43
Towards a Comprehensive Super-Pixel Representation of Traffic Scenes Uwe Franke !!et al.!!, Daimler R&D

Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Towards a Comprehensive Super-Pixel Representation

of Traffic Scenes

Uwe Franke !!et al.!!, Daimler R&D

Page 2: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Autonomous Vehicles: it’s all about Sensors

2007 2017

2

Page 3: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

What’s a good Representation of the Scene?

A good representation

should be

1. Compact

2. Complete

3. Efficient

4. Explicit

5. Accurate

6. Robust

On which level should sensor fusion act?

1. Object level (boxes)?

2. Low level or even raw data?

3. Some intermedium level?

3

Page 4: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

The Stixel-Representation

D. Pfeiffer and U. Franke: „Towards a Global Optimal Multi-Layer Stixel Representation of Dense 3D Data”, BMVC 2011

500.000 3D-Points500.000 3D-Points -> 500…1000 Stixel

4

S.Gehrig, F.Eberli, T.Meyer, “A Real-time Low-Power Stereo Vision Engine Using Semi-Global Matching”,

ICVS 2009 (Best Paper Award)

Page 5: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

The first Attempt: Stixel 1.0

far

...clo

se

1. Compute Disparities

2. Compute Freespace

3. Compute Stixel Height

4. Refine Stixel Distance

H.Badino, U.Franke, and D.Pfeiffer: “The Stixel World – A Compact Medium Level Representation of the 3D-World”, DAGM Symposium 2009

5

Page 6: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Stixel anno 2009

6

Page 7: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Stixel-based Focus of Attention

The Stixel-World reduces the computational burden of classification schemes significantly,

at the same time reducing the false positive rate.

5.000 Hypos

500 Hypos

HOG

Class 1

Class 2

5x less false positives

SGMstereo

Classifier

Another 8x less false positives

500 Hypos

SGM

Stixel

500 Hypos

Stixel

500 car hypothesis centered at the Stixels

M. Enzweiler, M. Hummel, and U. Franke: „Efficient Stixel-Based Object Recognition“, IEEE Intelligent Vehicles Symposium IV 20127

Page 8: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Stixel 1.0: Mission accomplished?

Goals

1. Compact

2. Complete

3. Efficient

4. Explicit

5. Accurate

6. Robust

Only the closest objects are represented

Integrated free-space computation takes much time

Stixel only encode geometry

Strong regularization reduces disparity noise significantly.

BUT: close objects may hide relevant parts of the scene.

8

Page 9: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Stixel Segmentation as an Optimization Problem

9

D. Pfeiffer and U. Franke: „Towards a Global Optimal Multi-Layer Stixel Representation of Dense 3D Data”, BMVC 2011

D. Gallup, M. Pollefeys, Jan-Michael Frahm: “3D Reconstruction using an n-LayerHeightmap”, DAGM 2010.

Page 10: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

How do we expect the scene to be arranged?

- Large objects are prefered

- Changes between the label typesP(ground → object) ≠ P(object → ground)

10

- In general, objects are earthbound

- Usually, just a small number of Stixels

per column

Page 11: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Robustness: Challenge Darkness

11

Page 12: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Robustness: Challenge Strong Rain

12

Page 13: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Robustness: Challenge Reflections

13

Page 14: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Stixel Motion and Segmentation

D. Peiffer and U. Franke: „Efficient Representation of Traffic Scenes by Means of Dynamic Stixels”, IEEE Int. Veh. Symposium 2010 (best paper award)

F. Erbs, B. Schwarz, and U. Franke: „From Stixels to Objects - A Conditional Random Field based Approach “, IEEE Intelligent Vehicles Symposium 2013

Stixels are optimally grouped based on their depth & motion.

GraphCut execution time: 1msec on a single core Intel i7.

14

Page 15: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Application 1: May I Enter the Round-About?

M.Muffert, T. Milbich, D.Pfeiffern, and U. Franke: “May I Enter the Roundabout? A Time-To-Contact Computation based on Stereo Vision”,

IEEE Intelligent Vehicles Symposium IV 2012 (best paper award)

-90° -20°-180° 0° +180°

2 Objects

~ 560.000 Pixels

~ 600 Stixels

15

Page 16: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Application 1: May I Enter the Round-About?

16

Page 17: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Stixel 2.0: Mission accomplished?

Goals

1. Compact

2. Complete

3. Efficient

4. Explicit

5. Accurate

6. Robust

Stixel only encode geometry and motion

Problems with non-planar roads

Strong regularization reduces disparity noise significantly.

BUT: still too many ghost objects per hour drive.

17

Page 18: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

There is more than Depth

So far, Stixels are estimated based on depth information only.

Some work has already adressed this, e.g. with online color modeling [1]:

18

Images taken from Sanberg et al. [1]

[1] W. P. Sanberg, G. Dubbelman, and P. H. de With, “Extending the Stixel World with Online Self-supervised Color Modeling for

Road-versus-Obstacle Segmentation”, ITSC 2014

Page 19: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

There is more than Depth

19

T. Scharwächter and U.Franke: “Low-Level Fusion of Color, Texture and Depth for Robust Road Scene Understanding”,

IEEE Intelligent Vehicles Symposium 2015

Page 20: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Feature Channels

Fast to compute and complementary feature transformations:

𝑟 =𝑅

𝑅 + 𝐺 + 𝐵𝑔 =

𝐺

𝑅 + 𝐺 + 𝐵𝑏 = 1 − 𝑟 − 𝑔rg chromaticity:filter bank:3D height above the ground planeVertical disparity gradient

20

Page 21: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Pixel-level Results

21

Ground Obstacle Vegetation Sky

Page 22: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Stixel Extension Results

Ground Obstacle Vegetation SkyGrass

22

Page 23: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

The Cityscapes Dataset

50 major German Cities

5000 precisely labeled frames

>4000 downloads

www.cityscapes-dataset.com

23

M. Cordts et al.: “The Cityscapes Dataset for Semantic Urban

Scene Understanding”, CVPR 2016

Page 24: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

The Cityscapes Benchmark

50 major German Cities

5000 precisely labeled frames

>4000 downloads

www.cityscapes-dataset.com

24

M. Cordts et al.: “The Cityscapes Dataset for Semantic Urban

Scene Understanding”, CVPR 2016

Benchmark Challenges

• pixel-level semantic labeling

• instance-level semantic labeling

Properties

• evaluation server• non-public test set

• prevent overfitting

• public evaluation scripts

• ranking website

• initial set of baselines

Page 25: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

The Cityscapes Benchmark

Adelaide Context [Lin et al. ‘16]

Name Reference

Classes Categories

fin

e

tun

ed

de

pth

co

ars

e

mst

CR

F

su

b

run

tim

e

IoU iIoU IoU iIoU

ResNet-38 Wu et al., 2016 80.6 57.8 91.0 79.1

PSPNet Zhao et al., 2017 80.2 58.1 90.6 78.2

TuSimple Wang et al., 2017 80.1 56.9 90.7 77.8

RefineNet Lin et al., 2017 73.6 47.2 87.9 70.6

LRR-4x Ghiasi and Fowlkes, 2016 71.8 47.9 88.4 73.9

FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2

Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1

Deep Layer Cascade Li et al., 2017 71.1 47.0 88.1 74.1

DeepLab v2 CRF Chen et al., 2016 70.4 42.6 86.4 67.7

Dilation 10 Yu and Koltun 2016 67.1 42.0 86.5 71.1 4 s

Scale invariant CNN Kreso et al., 2016 66.3 44.9 85.0 71.2

SQ Treml et al., 2016 59.8 32.3 84.3 66.0 60 ms

ENet Paszke et al., 2016 58.3 34.4 80.4 64.0 2 13 ms

25

Page 26: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

26

each layer computes one of a few

simple mathematical operations that

can be highly parallelized (GPUs)

low-level features

first layers typically learn simple

edge-detection and color filters

mid-level features

typically detect simple shapes

like corners, circles, patterns, …

high-level features

complex shapes, parts of larger

objects (wheels), …

Deep Learning in Fully Convolutional Neural Networks (2015)

J. Long, E. Shelhamer, and T. Darrell: “Fully convolutional networks for semantic segmentation,” CVPR 2015

Page 27: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Getting closer to Human Perception

27

Page 28: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Real-Time Scene Labeling

28

Page 29: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

FCNs at Rainy Nights

29

Page 30: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Stixel and Semantics

Semantic Labeling: 2 Mio labeled points

Stereo Matching: 2 Mio 3D-points 3D Stixel Representation: 1.000 Stixel

Semantic Stixel: 1000 Stixel

L. Schneider et al.: “Semantic Stixel: Depth is not Enough”, IEEE Intelligent Vehicles Symposium 2016 (Best Paper Award)

30

Page 31: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Semantic Stixels: Results

image semantic input semantic representation depth input depth representation

31

Page 32: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Semantic Stixel World in Downtown Stuttgart

32

Page 33: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Semantic Stixel in 3D

33

Page 34: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Lost! … and Found by CNN

34

Even small objects on the road can cause damage to the car and must be avoided by all means.

Humans are brilliant in detecting those objects.

Page 35: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

False Positive Rates for Lost Cargo Fusion

CNN-based Lost Cargo Detection Stereo-based Lost Cargo Detection

Lost Cargo Fusion

The false positive detections (above) disappeared.

Page 36: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Stixel 3.0: Mission accomplished?

Goals

1. Compact

2. Complete

3. Efficient

4. Explicit

5. Accurate

6. Robust

Reconstruction error is high in hilly areas due to simple model.

Semantics seems to solve all problems with ghost objects, but

geometric models fights agains semantics too much in SFO

36

Page 37: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Slanted Stixels: Solving the SFO-Problem

37

Original Stixels

Slanted Stixels

New model to represent all classes:

With priors according to the class: 𝐸𝑝𝑙𝑎𝑛𝑒 𝑠𝑖 = (𝑎−𝜇𝑐𝑖

𝑎

𝜎𝑐𝑖𝑎 )2 + (

𝑏−𝜇𝑐𝑖𝑏

𝜎𝑐𝑖𝑏 )2 − log 𝑍

Optimized jointly within Semantic Stixel probabilistic framework

𝜇 𝑠𝑖 , 𝑣 = 𝑏𝑖 ∗ 𝑣 + 𝑎𝑖

D. Hernandez et al.: “Slanted Stixels: Representing San Francisco's Steepest Street”, BMVC 2017 (Best Paper Award)

Page 38: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

New Dataset: SYNTHIA-San Francisco

38

• Generated with SYNTHIA toolkit to evaluate our algorithm, features slanted roads

• Photorealistic virtual sequence (2224 images), pixel-level depth and semantic ground truth

• Expensive to generate equivalent real-data sequence

• Will be available on Synthia soon: http://www.synthia-dataset.net

Page 39: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Results: Frame-Rate39

• Our version is slightly slower because of the increased complexity

Higher is better

Stixels time measured on 6-

core Intel i7

Metric Dataset Stixel 3.0 Slanted S.

Disp Err (%)

Ladicky 17.3 16.9

KITTI 15 10.9 11.0

SYNTHIA-SF 30.9 12.9

IoU (%)

Ladicky 63.5 63.4

Cityscapes 65.7 65.8

SYNTHIA-SF 46.0 48.5

Frame-rate

(Hz)

KITTI 15 113 61

Cityscapes 20.9 6.6

SYNTHIA-SF 19.4 4.7

Page 40: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Stixel Computation Complexity: Pre-Segmentation40

Dynamic Programming

( h’ x h’ ), h’ << h

Semantic

Segmentation

Disparity Image

Ground Object Sky

• Infer possible Stixel cuts (pre-segmentation) from image

• Avoid checking all possible Stixel combinations

• If given the correct Stixel cuts, same accuracy (or better!)

Pre-Segmentation

( h )

Page 41: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Pre-Segmentation Results: Frame-rate41

• Pre-segmentation speeds up both original and Slanted Stixels

Presegmentation

Metric Dataset Stixel 3.0 Slanted S. Stixel 3.0 Slanted S.

Disp Err (%)

Ladicky 17.3 16.9 18.5 17.8

KITTI 15 10.9 11.0 11.8 11.7

SYNTHIA-SF 30.9 12.9 33.9 15.4

IoU (%)

Ladicky 63.5 63.4 63.9 63.7

Cityscapes 65.7 65.8 65.7 65.8

SYNTHIA-SF 46.0 48.5 46.9 48.5

Frame-rate

(Hz)

KITTI 15 113 61 120 116

Cityscapes 20.9 6.6 36.6 27.5

SYNTHIA-SF 19.4 4.7 38.9 33.1 Higher is better

Page 42: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Visual Examples42

Left Image Original Stixels Slanted Stixels

Page 43: Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Goals

1. Compact

2. Complete

3. Efficient

4. Explicit

5. Accurate

6. Robust