Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade

Towards a Comprehensive Super-Pixel Representation

of Traffic Scenes

Uwe Franke !!et al.!!, Daimler R&D

Autonomous Vehicles: it’s all about Sensors

2007 2017

2

What’s a good Representation of the Scene?

A good representation

should be

1. Compact

2. Complete

3. Efficient

4. Explicit

5. Accurate

6. Robust

On which level should sensor fusion act?

1. Object level (boxes)?

2. Low level or even raw data?

3. Some intermedium level?

3

The Stixel-Representation

D. Pfeiffer and U. Franke: „Towards a Global Optimal Multi-Layer Stixel Representation of Dense 3D Data”, BMVC 2011

500.000 3D-Points500.000 3D-Points -> 500…1000 Stixel

4

S.Gehrig, F.Eberli, T.Meyer, “A Real-time Low-Power Stereo Vision Engine Using Semi-Global Matching”,

ICVS 2009 (Best Paper Award)

The first Attempt: Stixel 1.0

far

...clo

se

1. Compute Disparities

2. Compute Freespace

3. Compute Stixel Height

4. Refine Stixel Distance

H.Badino, U.Franke, and D.Pfeiffer: “The Stixel World – A Compact Medium Level Representation of the 3D-World”, DAGM Symposium 2009

5

Stixel anno 2009

6

Stixel-based Focus of Attention

The Stixel-World reduces the computational burden of classification schemes significantly,

at the same time reducing the false positive rate.

5.000 Hypos

500 Hypos

HOG

Class 1

Class 2

5x less false positives

SGMstereo

Classifier

Another 8x less false positives

500 Hypos

SGM

Stixel

500 Hypos

Stixel

500 car hypothesis centered at the Stixels

M. Enzweiler, M. Hummel, and U. Franke: „Efficient Stixel-Based Object Recognition“, IEEE Intelligent Vehicles Symposium IV 20127

Stixel 1.0: Mission accomplished?

Goals

1. Compact

2. Complete

3. Efficient

4. Explicit

5. Accurate

6. Robust

Only the closest objects are represented

Integrated free-space computation takes much time

Stixel only encode geometry

Strong regularization reduces disparity noise significantly.

BUT: close objects may hide relevant parts of the scene.

8

Stixel Segmentation as an Optimization Problem

9

D. Pfeiffer and U. Franke: „Towards a Global Optimal Multi-Layer Stixel Representation of Dense 3D Data”, BMVC 2011

D. Gallup, M. Pollefeys, Jan-Michael Frahm: “3D Reconstruction using an n-LayerHeightmap”, DAGM 2010.

How do we expect the scene to be arranged?

- Large objects are prefered

- Changes between the label typesP(ground → object) ≠ P(object → ground)

10

- In general, objects are earthbound

- Usually, just a small number of Stixels

per column

Robustness: Challenge Darkness

11

Robustness: Challenge Strong Rain

12

Robustness: Challenge Reflections

13

Stixel Motion and Segmentation

D. Peiffer and U. Franke: „Efficient Representation of Traffic Scenes by Means of Dynamic Stixels”, IEEE Int. Veh. Symposium 2010 (best paper award)

F. Erbs, B. Schwarz, and U. Franke: „From Stixels to Objects - A Conditional Random Field based Approach “, IEEE Intelligent Vehicles Symposium 2013

Stixels are optimally grouped based on their depth & motion.

GraphCut execution time: 1msec on a single core Intel i7.

14

Application 1: May I Enter the Round-About?

M.Muffert, T. Milbich, D.Pfeiffern, and U. Franke: “May I Enter the Roundabout? A Time-To-Contact Computation based on Stereo Vision”,

IEEE Intelligent Vehicles Symposium IV 2012 (best paper award)

-90° -20°-180° 0° +180°

2 Objects

~ 560.000 Pixels

~ 600 Stixels

15

Application 1: May I Enter the Round-About?

16


Goals

1. Compact

2. Complete

3. Efficient

4. Explicit

5. Accurate

6. Robust

Stixel only encode geometry and motion

Problems with non-planar roads

Strong regularization reduces disparity noise significantly.

BUT: still too many ghost objects per hour drive.

17

There is more than Depth

So far, Stixels are estimated based on depth information only.

Some work has already adressed this, e.g. with online color modeling [1]:

18

Images taken from Sanberg et al. [1]

[1] W. P. Sanberg, G. Dubbelman, and P. H. de With, “Extending the Stixel World with Online Self-supervised Color Modeling for

Road-versus-Obstacle Segmentation”, ITSC 2014

There is more than Depth

19

T. Scharwächter and U.Franke: “Low-Level Fusion of Color, Texture and Depth for Robust Road Scene Understanding”,

IEEE Intelligent Vehicles Symposium 2015

Feature Channels

Fast to compute and complementary feature transformations:

𝑟 =𝑅

𝑅 + 𝐺 + 𝐵𝑔 =

𝐺

𝑅 + 𝐺 + 𝐵𝑏 = 1 − 𝑟 − 𝑔rg chromaticity:filter bank:3D height above the ground planeVertical disparity gradient

20

Pixel-level Results

21

Ground Obstacle Vegetation Sky

Stixel Extension Results

Ground Obstacle Vegetation SkyGrass

22

The Cityscapes Dataset

50 major German Cities

5000 precisely labeled frames

>4000 downloads

www.cityscapes-dataset.com

23

M. Cordts et al.: “The Cityscapes Dataset for Semantic Urban

Scene Understanding”, CVPR 2016

The Cityscapes Benchmark

50 major German Cities

5000 precisely labeled frames

>4000 downloads

www.cityscapes-dataset.com

24

M. Cordts et al.: “The Cityscapes Dataset for Semantic Urban

Scene Understanding”, CVPR 2016

Benchmark Challenges

• pixel-level semantic labeling

• instance-level semantic labeling

Properties

• evaluation server• non-public test set

• prevent overfitting

• public evaluation scripts

• ranking website

• initial set of baselines

The Cityscapes Benchmark

Adelaide Context [Lin et al. ‘16]

Name Reference

Classes Categories

fin

e

tun

ed

de

pth

co

ars

e

mst

CR

F

su

b

run

tim

e

IoU iIoU IoU iIoU

ResNet-38 Wu et al., 2016 80.6 57.8 91.0 79.1

PSPNet Zhao et al., 2017 80.2 58.1 90.6 78.2

TuSimple Wang et al., 2017 80.1 56.9 90.7 77.8

RefineNet Lin et al., 2017 73.6 47.2 87.9 70.6

LRR-4x Ghiasi and Fowlkes, 2016 71.8 47.9 88.4 73.9

FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2

Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1

Deep Layer Cascade Li et al., 2017 71.1 47.0 88.1 74.1

DeepLab v2 CRF Chen et al., 2016 70.4 42.6 86.4 67.7

Dilation 10 Yu and Koltun 2016 67.1 42.0 86.5 71.1 4 s

Scale invariant CNN Kreso et al., 2016 66.3 44.9 85.0 71.2

SQ Treml et al., 2016 59.8 32.3 84.3 66.0 60 ms

ENet Paszke et al., 2016 58.3 34.4 80.4 64.0 2 13 ms

25

26

each layer computes one of a few

simple mathematical operations that

can be highly parallelized (GPUs)

low-level features

first layers typically learn simple

edge-detection and color filters

mid-level features

typically detect simple shapes

like corners, circles, patterns, …

high-level features

complex shapes, parts of larger

objects (wheels), …

Deep Learning in Fully Convolutional Neural Networks (2015)

J. Long, E. Shelhamer, and T. Darrell: “Fully convolutional networks for semantic segmentation,” CVPR 2015

Getting closer to Human Perception

27

Real-Time Scene Labeling

28

FCNs at Rainy Nights

29

Stixel and Semantics

Semantic Labeling: 2 Mio labeled points

Stereo Matching: 2 Mio 3D-points 3D Stixel Representation: 1.000 Stixel

Semantic Stixel: 1000 Stixel

L. Schneider et al.: “Semantic Stixel: Depth is not Enough”, IEEE Intelligent Vehicles Symposium 2016 (Best Paper Award)

30

Semantic Stixels: Results

image semantic input semantic representation depth input depth representation

31

Semantic Stixel World in Downtown Stuttgart

32

Semantic Stixel in 3D

33

Lost! … and Found by CNN

34

Even small objects on the road can cause damage to the car and must be avoided by all means.

Humans are brilliant in detecting those objects.

False Positive Rates for Lost Cargo Fusion

CNN-based Lost Cargo Detection Stereo-based Lost Cargo Detection

Lost Cargo Fusion

The false positive detections (above) disappeared.


Goals

1. Compact

2. Complete

3. Efficient

4. Explicit

5. Accurate

6. Robust

Reconstruction error is high in hilly areas due to simple model.

Semantics seems to solve all problems with ghost objects, but

geometric models fights agains semantics too much in SFO

36

Slanted Stixels: Solving the SFO-Problem

37

Original Stixels

Slanted Stixels

New model to represent all classes:

With priors according to the class: 𝐸𝑝𝑙𝑎𝑛𝑒 𝑠𝑖 = (𝑎−𝜇𝑐𝑖

𝑎

𝜎𝑐𝑖𝑎 )2 + (

𝑏−𝜇𝑐𝑖𝑏

𝜎𝑐𝑖𝑏 )2 − log 𝑍

Optimized jointly within Semantic Stixel probabilistic framework

𝜇 𝑠𝑖 , 𝑣 = 𝑏𝑖 ∗ 𝑣 + 𝑎𝑖

D. Hernandez et al.: “Slanted Stixels: Representing San Francisco's Steepest Street”, BMVC 2017 (Best Paper Award)

New Dataset: SYNTHIA-San Francisco

38

• Generated with SYNTHIA toolkit to evaluate our algorithm, features slanted roads

• Photorealistic virtual sequence (2224 images), pixel-level depth and semantic ground truth

• Expensive to generate equivalent real-data sequence

• Will be available on Synthia soon: http://www.synthia-dataset.net

http://www.synthia-dataset.net/

Results: Frame-Rate39

• Our version is slightly slower because of the increased complexity

Higher is better

Stixels time measured on 6-

core Intel i7

Metric Dataset Stixel 3.0 Slanted S.

Disp Err (%)

Ladicky 17.3 16.9

KITTI 15 10.9 11.0

SYNTHIA-SF 30.9 12.9

IoU (%)

Ladicky 63.5 63.4

Cityscapes 65.7 65.8

SYNTHIA-SF 46.0 48.5

Frame-rate

(Hz)

KITTI 15 113 61

Cityscapes 20.9 6.6

SYNTHIA-SF 19.4 4.7

Stixel Computation Complexity: Pre-Segmentation40

Dynamic Programming

( h’ x h’ ), h’ << h

Semantic

Segmentation

Disparity Image

Ground Object Sky

• Infer possible Stixel cuts (pre-segmentation) from image

• Avoid checking all possible Stixel combinations

• If given the correct Stixel cuts, same accuracy (or better!)

Pre-Segmentation

( h )

Pre-Segmentation Results: Frame-rate41

• Pre-segmentation speeds up both original and Slanted Stixels

Presegmentation

Metric Dataset Stixel 3.0 Slanted S. Stixel 3.0 Slanted S.

Disp Err (%)

Ladicky 17.3 16.9 18.5 17.8

KITTI 15 10.9 11.0 11.8 11.7

SYNTHIA-SF 30.9 12.9 33.9 15.4

IoU (%)

Ladicky 63.5 63.4 63.9 63.7

Cityscapes 65.7 65.8 65.7 65.8

SYNTHIA-SF 46.0 48.5 46.9 48.5

Frame-rate

(Hz)

KITTI 15 113 61 120 116

Cityscapes 20.9 6.6 36.6 27.5

SYNTHIA-SF 19.4 4.7 38.9 33.1 Higher is better

Visual Examples42

Left Image Original Stixels Slanted Stixels

Goals

1. Compact

2. Complete

3. Efficient

4. Explicit

5. Accurate

6. Robust

Documents

Towards a Comprehensive Super-Pixel Representation of ... · FRRN Pohlen et al., 2017 71.8 45.5 88.9 75.1 2 Adelaide Context Lin et al., 2016 71.6 51.7 87.3 74.1 Deep Layer Cascade