3D ShapeNets: A Deep Representation for Volumetric Shape...

3D ShapeNets: A Deep Representation for

Volumetric Shape Modelingby Wu, Song, Khosla, Yu, Zhang, Tang, Xiao

presented by Abhishek Sinha

3D Shape Prior

2Slide Credit: Wu, Song et al. 3D ShapeNets: A Deep Representation for Volumetric Shape Modeling, CVPR 2015

3D Shape Prior

Outline

• Problem

Outline

• Problem• Motivation

Outline

• Problem• Motivation• Desirable Properties for Representation

Outline

• Problem• Motivation• Desirable Properties for Representation• Architecture

Outline

• Problem• Motivation• Desirable Properties for Representation• Architecture• Dataset

Outline

• Problem• Motivation• Desirable Properties for Representation• Architecture• Dataset• Applications

Outline

• Problem• Motivation• Desirable Properties for Representation• Architecture• Dataset• Applications• Extensions

Outline

• Problem• Motivation• Desirable Properties for Representation• Architecture• Dataset• Applications• Extensions• Discussion Points

Problem

Learn ‘Useful’ 3D shape representations from

images

Motivation

3D Shape Representation

Slide Credit: Wu, Song et al. 3D ShapeNets: A Deep Representation for Volumetric Shape Modeling, CVPR 2015

Shape Synthesis

Shape Completion

Shape Synthesis

Shape Completion

2.5D Object Recognition

person

tricycle

Shape Synthesis

Feature ExtractorShape Completion

2.5D Object Recognition

person

tricycle

Desirable Properties

What is a Desirable 3D Shape Representation?

Data-driven

Generic

Compositional

Versatile

Data-driven

Generic

Compositional

Versatile

Data-driven

Simple Shapes Complex Shapes

Data-driven

Generic

Compositional

Versatile

Generic

building blocks full object

Data-driven

Generic

Compositional

VersatileCompositional

mesh classification shape completion

shape generation2.5D object recognition

person

tricycle

Data-driven

Generic

Compositional

Versatile

Versatile Slide Credit: Wu, Song et al. 3D ShapeNets: A Deep Representation for Volumetric Shape Modeling, CVPR 2015

Architecture

3D Deep Learning

3D Deep Learning3D Shape as Volumetric Representation

3D Deep Learning

3D Shape as Volumetric Representation

3D Deep Learning

mesh binary voxel

3D ShapeNets

Convolutional Deep Belief Network

3D ShapeNets

A Deep Belief Network is a generative graphical model that describes the distribution of input x over class y.

3D ShapeNets

• Convolution to enable compositionality• No pooling to reduce reconstruction

3D ShapeNets

• Convolution to enable compositionality• No pooling to reduce reconstruction

Layer 1-3 convolutional RBM

Layer 4 fully connected RBM

Layer 5 multinomial label + Bernoulli feature form an associate memory

configurations

3D ShapeNets

3D ShapeNets ≠ CNNs

3D ShapeNets

generative process

discriminative process

3D ShapeNets

* 3D ShapeNets can be converted into a CNN, and discriminatively trained with back-propagation.

generative process

discriminative process

TrainingMaximum Likelihood Learning

Training

Layer-wise pre-training:

Lower four layers are trained by CD

Last layer is trained by FPCD[1]

Fine-tuning:

Wake sleep[2] but keep weights tied

[2] Hinton, et al "A fast learning algorithm for deep belief nets." Neural computation[1] Tijmen, et al. "Using fast weights to improve persistent contrastive divergence.”

Maximum Likelihood Learning

Training

Layer-wise pre-training:

Lower four layers are trained by CD

Last layer is trained by FPCD[1]

Fine-tuning:

Wake sleep[2] but keep weights tied

[2] Hinton, et al "A fast learning algorithm for deep belief nets." Neural computation[1] Tijmen, et al. "Using fast weights to improve persistent contrastive divergence.”

Maximum Likelihood Learning

generation process:Gibbs Sampling

Convolutional Deep Belief Network 19

Sampling

object label

Sampling

object label

Sampling

object label

Sampling

object label

••••

Sampling

object label

••••

Dataset

Big 3D Data

Big 3D DataQuery Keyword: common object categories from the SUN database that

contain no less than 20 object instances per category

Big 3D Data

Query Keyword: common object categories from the SUN database that

contain no less than 20 object instances per category

Big 3D Data

151,128 models 660 categories Slide Credit: Wu, Song et al. 3D ShapeNets: A Deep Representation for Volumetric Shape Modeling, CVPR 2015

Applications

2.5D Completion & Recognition

Slide Credit: Wu et al

Gibbs sampling with clamping Slide Credit: Wu et al

27[29] R. Socher, B. Huval, B. Bhat, C. D. Manning, and A. Y. Ng. Convolutional-recursive deep learning for 3d object classification. In NIPS 2012.

Training on CAD models and no discriminative tuning!

View Planning for Recognition

Volumetric representation

bathtub?What is it?

?dresser?

bathtub?What is it?

?dresser?

Not sure. Look from

another view?

bathtub?What is it?

?dresser?

Not sure. Look from

another view? Where to look next?

Next-Best-View

Volumetric representation New depth map

bathtub?What is it?

?dresser?

Not sure. Look from

Next-Best-View

bathtub?What is it?

?dresser?

Aha! It is a sofa!

Not sure. Look from

Next-Best-View

bathtub?What is it?

?dresser?

3D ShapeNets

Aha! It is a sofa!

Not sure. Look from

Next-Best-View

Deep View Planning

0.3 0.4

0.8 0.3

0.80.3

Deep View Planning

0.3 0.4

0.8 0.3

0.80.3

Deep View Planning

0.3 0.4

0.8 0.3

0.80.3

Deep View Planning

0.3 0.4

0.8 0.3

0.80.3

Deep View Planning

0.3 0.4

0.8 0.3

0.80.3

Deep View Planning

0.3 0.4

0.8 0.3

0.80.3

Deep View Planning

0.3 0.4

0.8 0.3

0.80.3

Deep View Planning

0.3 0.4

0.8 0.3

0.80.3

Deep View Planning

Mathematically, this is equivalent to evaluate the conditional mutual information:

0.3 0.4

0.8 0.3

0.80.3

Recognition Accuracy from Two Views.

Deep View Planning

Based on the algorithms’ choice, we obtain the actual depth map for the next view and recognize the objects using two views by our 3D ShapeNets.

Recognition Accuracy from Two Views.

Deep View Planning

Based on the algorithms’ choice, we obtain the actual depth map for the next view and recognize the objects using two views by our 3D ShapeNets.

Our algorithm stands out as the uncertainty of the first view increasesSlide Credit: Wu, Song et al. 3D ShapeNets: A Deep Representation for Volumetric Shape Modeling, CVPR 2015

Back Propagation Fine-tuning

48 filters of stride 2

object label

3D ShapeNets

object label

3D ShapeNets

object label

3D ShapeNets

object label

3D ShapeNets

object label

3D ShapeNets

object label

3D ShapeNets

object label

3D ShapeNets

object label

3D ShapeNets 3D CNN

As a 3D Feature Extractor

Mesh Classification & Retrieval

[29] R. Socher, B. Huval, B. Bhat, C. D. Manning, and A. Y. Ng. Convolutional-recursive deep learning for 3d object classification. In NIPS 2012.

2.5D object recognition

33mesh retrieval

Extensions

• Include RGB information in representation

• 3D Segmentation

• Improve for non-rigid 3D objects

Discussion Points

• Is the network deep enough?

• 30x30x30 = 27000 vs 256x256 = 65000 for Image Net

• 150K training examples vs millions for Image Net

• Won’t removal of max-pooling layers hurt performance on classification tasks?

http://3dshapenets.cs.princeton.edu/38

• Any other systems that use binary units with approximate training and inference techniques rather than standard back-prop?

• Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18.7 (2006): 1527-1554

• Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. "Restricted Boltzmann machines for collaborative filtering." Proceedings of the 24th international conference on Machine learning. ACM, 2007.

• Are there better ways for representing 3D Shapes. In particular, doesn’t the voxel representation have the bottleneck of cubic dependency on grid size?

• Yes. Su, Majhi et al that tries to recognize 3D shapes from multiple 2D views instead of voxel representation and get better results for classification .

H. Su, S. Maji, E. Kalogerakis, E. Learned-Miller. Multi-view Convolutional Neural Networks for 3D Shape Recognition. ICCV2015.40

• Are there other 3D CAD model datasets

• 3D Warehouse. https://3dwarehouse.sketchup.com/

• Manually removing clutter from 3D CAD models a problem

• Did not address non-rigid objects sufficiently.

• Even the 40 model classification dataset seemed to contain only 4 non-rigid categories — persons, plant, sofas, curtains.

Appendix

Contrastive divergence learning: A quick way to learn an RBM

0>< jihv 1>< jihv

t = 0 t = 1

)( 10 ><−><=Δ jijiij hvhvw ε

Start with a training vector on the visible units. Update all the hidden units in parallel Update all the visible units in parallel to get a “reconstruction”. Update all the hidden units again.

This is not following the gradient of the log likelihood. But it works well. It is approximately following the gradient of another objective function.

reconstructiondata

Slide Credit: Geoffery Hinton

The wake-sleep algorithm for an SBN

• Wake phase: Use the recognition weights to perform a bottom-up pass.

– Train the generative weights to reconstruct activities in each layer from the layer above.

• Sleep phase: Use the generative weights to generate samples from the model.

– Train the recognition weights to reconstruct activities in each layer from the layer below.

Slide Credit: Geoffery Hinton

3D ShapeNets: A Deep Representation for Volumetric Shape...

Documents

Poselets: Body Part Detectors Trained Using 3D Human Pose ...vision.cs.utexas.edu/381V-spring2016/slides/venkatesh-expt.pdfPoselets: Body Part Detectors Trained Using 3D Human Pose

What's In A Name?vision.cs.utexas.edu/381V-fall2016/slides/nelson-expt.pdf · Performed name and gender classification Also tested on ~400 randomly selected IMDB-Wiki images Dataset

3D ShapeNets: A Deep Representation for Volumetric …3dshapenets.cs.princeton.edu/paper.pdf · 3D ShapeNets: A Deep Representation for Volumetric Shapes Zhirong Wu y? Shuran Song

Divide, Share, and Conquer: Multi-task Attribute Learning ...vision.cs.utexas.edu/projects/resistshare/book_chapter.pdfDivide, Share, and Conquer: Multi-task Attribute Learning withSelective

PAIRWISE DECOMPOSITION OF IMAGE SEQUENCES FOR ACTIVE …vision.cs.utexas.edu/381V-fall2016/slides/you_expt.pdf · EXPERIMENT2* What if the effect of relative pose is weaker? The activation

ShadowDraw - University of Texas at Austinvision.cs.utexas.edu/381V-fall2016/slides/priyadarshi-paper.pdf · • Shadow Creation by Weighting. Image Matching • Obtain Candidate

Object detection - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/spring2016_detection.pdf · •Non-rigid, deformable objects not captured well with representations

Clues from the Beaten Path: Location Estimation with ...vision.cs.utexas.edu/projects/location-estimation/location_cvpr2011.… · ognize every photo independently, and their resulting

3D ShapeNets: A Deep Representation for Volumetric Shapesvision.princeton.edu/projects/2014/3DShapeNets/paper.pdf · shapes: most notably the DBN [15] to generate handwrit-ten digits

5. Image 1: Mall - vision.cs.utexas.edu

Rich feature hierarchies for accurate object detection and ...vision.cs.utexas.edu/381V-spring2016/slides/zheng-paper.pdf · Related Future Work Papers Fast R-CNN, by Ross Girshick

Adriana Kovashka, Sudheendra Vijayanarasimhan, and Kristen ...vision.cs.utexas.edu/attributes_active/adriana_iccv11_poster.pdf · Adriana Kovashka, Sudheendra Vijayanarasimhan, and

Ambient Sound Provides Supervision for Visual Learningvision.cs.utexas.edu/381V-fall2016/slides/nguyen_paper.pdf · Ambient Sound Provides Supervision for Visual Learning Andrew Owens1,

Deep Visual Representations Quantifying Interpretability ofvision.cs.utexas.edu/381V-fall2017/slides/crosley-goo-exp.pdfDeep Visual Representations By David Bau, Bolei Zhou, Aditya

VisualEchoes: Spatial Image Representation Learning ...vision.cs.utexas.edu/projects/visualEchoes/gao-eccv2020-visualecho… · animals and blind people obtain spatial information

Towards Transparent Systems: Semantic Characterization of ...vision.cs.utexas.edu/hmcv2014/bansal_etal_hmcv2014.pdf · 4 Aayush Bansal, Ali Farhadi, and Devi Parikh Random SC - all

CS381V Experiment Presentationvision.cs.utexas.edu/381V-spring2016/slides/kuo-expt.pdf · The Paper • Indoor Segmentation and Support Inference from RGBD Images. N. Silberman, D

Ryan Johnson cslogin:rjohnson CS378 Pset3 Image …vision.cs.utexas.edu/cv-fall2009/mosaics/rjohnson_file2.pdf · Ryan Johnson cslogin:rjohnson CS378 Pset3 Section II – Image Mosaics

Following Gaze in Video - vision.cs.utexas.edu

Learning Representations for Automatic Colorizationvision.cs.utexas.edu/381V-fall2016/slides/nagarajan-expt.pdf · Larsson et al. (2016) Previous attempts: Transfer, Scribble. Idea