Scene Understanding with 3D Deep Networks3ddl.cs.princeton.edu/2016/slides/funkhouser.pdfCVPR 2016...

Scene Understanding

with 3D Deep Networks

Thomas Funkhouser

Princeton University

Disclaimer: I am talking about the work of these people …

Shuran Song Andy Zeng Fisher Yu

Angela Dai Matthias Niessner Matt FisherJianxiong Xiao

Maciej HalberYinda Zhang

Understanding indoor scenes observed in RGB-D images

• Robotics

• Augmented reality

• Virtual tourism

• Surveillance

• Home remodeling

• Real estate

• Telepresence

• Forensics

• Games

• etc.

Understanding indoor scenes observed in RGB-D images

Input RGB-D Image(s)

Semantic Segmentation

Understanding indoor scenes observed in RGB-D images in 3D

3D Scene Understanding

Input RGB-D Image(s)

Understanding indoor scenes observed in RGB-D images in 3D

• Surface reconstruction

• Amodal object detection

• Object relationships

• Materials, lights, etc.

• Physical properties

• Novel views

• Info sharing

• Spatial inference

• Simulation

• etc.

Goal for This Talk

Learn ConvNets to recognize patterns in voxels

• Local shape descriptor

• Amodal object detection

• Semantic scene completion

Talk Outline

Local shape descriptor

Amodal object detection

Semantic scene completion

Talk Outline

Local shape descriptor

LargeA. Zeng, S. Song, M. Niessner, M. Fisher, J. Xiao, T. Funkhouser,

“3DMatch: Learning Local Geometric Descriptors from 3D Reconstructions,”

submitted to CVPR 2017

Local Shape Descriptor

Goal: train a discriminating 3D local shape descriptor from data

Local shape descriptor Local shape descriptor

…0.58 0.21 0.92 0.67 0.04 0.53

Match!

0.58 0.21 0.92 0.67 0.04 0.53 …

Challenge: where to get training data?

Local Shape Descriptor: “3D Match”

Approach: train on wide-baseline correspondences in RGB-D reconstructions

“Ground truth” match between

RGB-D Images from different views

Approach: train on wide-baseline correspondences in RGB-D reconstructions

Method: sample true/false correspondences from RGB-D reconstructions,

train Siamese network

Result: learns to discriminate local shapes found in real-world data

Local Shape Descriptor: “3D Match” Results

Result 1: learned feature descriptor predicts RGB-D point correspondences

more accurately than hand-tuned descriptors

Match classification error at 95% recall

Fragment Alignment Success Rate

Result 2: feature descriptor learned from RGB-D reconstructions provides

matching for recognizing poses of small objects in Amazon Picking Challenge

Predicting pose of 3D object model in RGB-D scan

Object pose prediction accuracy

Result 3: feature descriptor learned from RGB-D reconstructions provides

discriminative matching of semantic correspondences on 3D meshes

Talk Outline

LargeS. Song and J. Xiao,

“Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images,”

CVPR 2016

Object Detection

Goal: given a RGB-D image, find objects (labeled 3D amodal bounding boxes)

Input: Single RGB-D Output: labeled 3D Amodal Boxes

[CVPR13] Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images

[IJCV14] Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and semantic segmentation

[ECCV14] Object Detection and Segmentation using Semantically Rich Image and Depth Features

[CVPR15] Aligning 3D Models to RGB-D Images of Cluttered Scenes

[CVPR16] Cross Modal Distillation for Supervision Transfer

2D Operations

2D Instance

Segmentation

Coarse Pose

Classification

Point Cloud

Alignment

2D Contour

Detection

2D Region

Proposal

2D Object

Detection

Encode Depth Map

as Extra Channels

3D Amodal

Detection Result

Depth Map

3D Output3D Input 3D

Object Detection

Most previous work:

3D Deep Learning

Object Detection: “Deep Sliding Shapes”

Approach:

3D Amodal

Detection Result

Depth Map

3D Operations 3D Output3D Input

Object Recognition NetworkRegion Proposal Network

RGB-D Image

Data encoding:

1) Estimate

major directions

of room

2) Compute

Data encoding:

1) Estimate

major directions

of room

2) Compute

Data encoding:

1) Estimate

major directions

of room

2) Compute

Region

Proposal

Network

TSDF 3D Region Proposals

3D region proposal network:

×50Pixel Area

Physical Size

3D region proposal network:

Multiscale 3D region proposal network:

3D Box

Softmax

Smooth

Receptive field: 0.4 m3

Level 1 Anchors

0.6×0.2×0.4 m

0.5×0.5×0.2 m

0.6×0.2×0.4 m

3D Box

Softmax

Smooth

Receptive field: 0.4 m3

3D Box

Softmax

Smooth

3D Box

Softmax

Smooth

Receptive field: 1 m3Receptive field: 0.4 m3

Level 2 Anchors

3D Box

Softmax

Smooth

Receptive field: 1 m3

RGB-D Image

project to 2D

Joint object recognition network:

Image Patch

2D VGG on ImageNet

3D ConvNet

Object Detection: “Deep Sliding Shapes” Experiments

Train and test on amodal boxes provided in SUN RGB-D

S. Song, S. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite,” CVPR 2015

2D Deep Learning

3D Deep Learning

3D Non-Deep Learning

Object Detection: “Deep Sliding Shapes” Results

Quantitative comparisons:

Object detection accuracy on NYU v2 dataset (mAP)

Sliding Shapes: sofa Ours: bathtub

Qualitative comparisons:

Sliding Shapes: chair Ours: sofa

Sliding Shapes: table Ours: bed

Sliding Shapes: miss Ours: table and chairs

Sliding Shapes: toilet Ours: garbage bin+bed

Talk Outline

Large S. Song, F. Yu, A. Zeng, A. Chang, M. Savva, and T. Funkhouser,

“Semantic Scene Completion from a Single Depth Image,”

Input: Single view depth map Output: Semantic scene completion

Semantic Scene Completion

Goal: given an RGB-D image, label all voxels by semantic class

3D Scene

visible surface

free space

occluded space

outside view

outside room

visible surface

free space

occluded space

outside view

outside room

3D Scene

semantic scene completion

This paper

scene completion Firman et al.

surface segmentation Silberman et al.

The occupancy and the object identity

are tightly intertwined !

3D Scene

Prior work: segmentation OR completion

Prediction: N+1 classes

SSCNet

Input: Single view depth map Output: Semantic scene completion

3D ConvNet

Semantic Scene Completion: “SSCNet”

Approach: end-to-end deep network

Semantic Scene Completion : “SSCNet”

Encode 3D space using flipped TSDF

Voxel size: 0.02 m

Local geometry

Receptive field: 0.98 m

High-level 3D context

via big receptive field

provided by

dilated convolution

Receptive field: 2.26

Multi-scale aggregation

Receptive field: 0.98 m Receptive field:1.62 m Receptive field: 2.26 m

Semantic Scene Completion: “SSCNet” Experiments

Where to get training data?

No dense volumetric ground truth with semantic labels for a complete scene

SUN3D: No semantic labelsNYU: only visible surfaces

SUNCG dataset

• 46K houses

• 50K floors

• 400K rooms

• 5.6M object instances

SUNCG dataset

synthetic camera views depth

ground truth

semantic scene

completion

SUNCG dataset

Train on SUNCG Test on NYU

Semantic Scene Completion: “SSCNet” Results

Result: better than previous volumetric completion algorithms

Comparison to previous algorithms for volumetric completion

Zhang et al.

Ground Truth

Ours(SSCNet)

Color Image Observed Surface

Firman et al.

Semantic Scene Completion: “SSCNet” Results

Result: better than previous 3D model fitting algorithms

Comparison to previous algorithms for 3D model fitting

Ours(SSCNet)Geiger and WangLin et al.

Color Image Observed Surface Ground Truth

Summary

Three projects where ConvNets are trained to recognize patterns in voxels

with different …

• Tasks

• Scales

• Training data

• Loss functions

• Network architectures

• Training protocols

Future Challenges

Acquiring larger data sets

Leveraging geometric structure

Leveraging semantic structure

Better integration RGB and D

Better surface parameterizations

Finer-grained categories

Higher resolution

Future Challenges

Higher resolution

Future Challenges

Higher resolution

1,500 surface reconstructions 36,213 labeled objects

A. Dai, A. Chang, M. Savva,

M. Halber, T. Funkhouser, and M. Niessner,

“ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes,”

submitted to CVPR 2017.

Future Challenges

Higher resolution

M. Halber, T. Funkhouser,

“Fine-to-Coarse Registration of RGB-D Scans,”

Future Challenges

Higher resolution

Future Challenges

Higher resolution

Future Challenges

Higher resolution

Sleeping Area

ottoman

sofadresser with mirror

dresser

nightstand

dresser

dresser with mirror

Y. Zhang, M. Bai, J. Xiao, P. Kohli, and S. Izadi,

“DeepContext: Context-Encoding Neural Pathways

for 3D Holistic Scene Understanding,”

Acknowledgments

Princeton:• Angel Chang, Maciej Halber, Manolis Savva, Elena Sizikova,

Shuran Song, Jianxiong Xiao, Fisher Yu, Yinda Zhang, Andy Zeng

Collaborators:• Angela Dai, Matt Fisher, Matthias Niessner, Ersin Yumer

Data:• SUN3D, 7-Scenes, Analysis-by-Synthesis, NYU, Trimble, Planner5D

Funding:• Intel, NSF, Adobe

Thank You!

Scene Understanding with 3D Deep Networks3ddl.cs.princeton.edu/2016/slides/funkhouser.pdfCVPR 2016...

Documents

Amodal Detection of 3D Objects: Inferring 3D …latecki/Papers/DengCVPR2017.pdfAmodal Detection of 3D Objects: Inferring 3D Bounding Boxes from 2D Ones ... their true locations and

PerspectiveNet: 3D Object Detection from a Single RGB Image via Perspective … · 2020-02-13 · perspective points as an intermediate representation of 2D and 3D bounding boxes,

A Review on Object Pose Recovery: from D Bounding Box Detectors to Full … · 2020. 4. 22. · 1 A Review on Object Pose Recovery: from 3D Bounding Box Detectors to Full 6D Pose

Learning Object Bounding Boxes for 3D Instance ... · Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds Bo Yang 1Jianan Wang 2 Ronald Clark 3 Qingyong Hu

Common Bounding Volumes

3D Point Estimation Using A Recursive Neural Networkcs229.stanford.edu/proj2016/report/Winter_3D... · predict a object’s 3D bounding box based on synthetic video-like image sequences

AUTOMATICALLY REDUCING AND BOUNDING GEOMETRIC COMPLEXITY ... · AUTOMATICALLY REDUCING AND BOUNDING GEOMETRIC COMPLEXITY BY USING IMAGES by ... Automatically Reducing and Bounding

SAIL-VOS: Semantic Amodal Instance Level Video Object ...openaccess.thecvf.com/content_CVPR_2019/papers/Hu...SAIL-VOS: Semantic Amodal Instance Level Video Object Segmentation –

Amodal Detection of 3D Objects: Inferring 3D Bounding ...Amodal Detection of 3D Objects: Inferring 3D Bounding Boxes from 2D Ones in RGB-Depth Images Zhuo Deng Longin Jan Latecki Temple

YOLO3D: End-to-end real-time 3D Oriented Object …openaccess.thecvf.com/content_ECCVW_2018/papers/11131/...YOLO3D: End-to-endreal-time 3D Oriented Object Bounding Box Detection from

3D Geometry for Computer Graphicsdcor/Graphics/cg-slides/3d_geometry_lesson2.pdf13 Usage of bounding boxes (bounding volumes) Serve as very simple “approximation” of the object

Bounding Problems

Efﬁcient Bird Eye View Proposals for 3D Siamese Tracking ... · adapting this work for use on 3D data. Song et al. [26] proposed Deep Sliding Shapes, an amodal 3D object detection

YOLO3D: End-to-end real-time 3D Oriented Object Bounding ......{waleed.ali, sherif.abdelkarim, mahmoud.ismail-zidan.ext, mohamed.zahran, ahmad.el-sallab}@valeo.com Abstract. Object

H+O: Unified Egocentric Recognition of 3D Hand-Object ...openaccess.thecvf.com/content_CVPR_2019/papers/...and object poses (shown as skeletons and 3D bounding boxes), object class

Amodal Instance Segmentation With KINS Datasetopenaccess.thecvf.com/...Amodal_Instance_Segmentation_With_KIN… · Amodal Instance Segmentation with KINS Dataset Lu Qi1,2 Li Jiang1,2

Probabilistic ICA to dissect modality-specific and amodal constituents of sensory ERPs

Deep Sliding Shapes for Amodal 3D Object Detection in RGB ... · Shapes [25] was proposed to slide a 3D detection window in 3D space. While it is limited by the use of hand-crafted

Amodal Detection of 3D Objects: Inferring 3D …openaccess.thecvf.com/content_cvpr_2017/papers/Deng_Amodal...Amodal Detection of 3D Objects: Inferring 3D Bounding Boxes from 2D Ones

Bounding Battery Risk: Managing Convergence in the ... › Portals › 57 › Bounding Battery Risk - ME.pdf · The “engineering and law” approach to battery bounding gives you