1
Neural Scene De-rendering Jiajun Wu 1 Joshua B. Tenenbaum 1 Pushmeet Kohli 2,* 1 Massachusetts Institute of Technology 2 DeepMind * Work done when the author was with Microsoft Research Scene De-rendering Goal: a compact, interpretable scene representation Applications Motivation An object-based, disentangled representation has wide applications Representations learned by current deep nets are hard to interpret Inpainting Solution: looping in a forward graphics engine in recognition Advantages Graphics engines bring in symbolic representation naturally Graphics engines generalize well to a variable number of objects The learned representation is rich, and has wide applications. Model <objects> <balloon: right> <bench: yellow> <tree: right> <boy: stand happy> <girl: sit sad> </objects> <objects> <pig: left close> <villager: left> <tree: tall right> </objects> De-render Render De-render Render (a) Input image Interpreting proposals Rendering images (c) Inference Proposing segments Applications (d) Rendered image (I) (II) (III) (b) Segment proposals Image editing: Captioning: The boy is … Inpainting, analogy-making, … (prediction loss) (reconstruction loss) Scene XML <object> <category>triangle</category> <size>1.5</size> <color>blue</color> <position>1.5,2,1</position> <yaw>0</yaw> </object> <object> (c) Original image (a) Corrupted input (b) Reconstruction : :: : : :: : : :: : Reference Reference Query Prediction : :: : Analogy-Making Graphics Engine Inference & Reconstruction If = 0 If = 1 (a) Input image (c) NSD (b) CNN+LSTM (d) NSD (full) Results Training Steps Supervised pre-training with the prediction loss End-to-end fine-tuning with the reconstruction loss (with REINFORCE) Graphics engines as generalized decoders Visually distinctive images may have similar representations Solution: Optimizing in both spaces jenny and mike are having fun in the sandbox unaware of the storm that's coming their way Caption Retrieval

Neural Scene De-renderingnsd.csail.mit.edu/talks/nsd_poster_cvpr.pdfNeural Scene De-rendering Jiajun Wu1 Joshua B. Tenenbaum1 PushmeetKohli2,* 1 Massachusetts Institute of Technology

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Neural Scene De-renderingnsd.csail.mit.edu/talks/nsd_poster_cvpr.pdfNeural Scene De-rendering Jiajun Wu1 Joshua B. Tenenbaum1 PushmeetKohli2,* 1 Massachusetts Institute of Technology

Neural Scene De-rendering Jiajun Wu1 Joshua B. Tenenbaum1 Pushmeet Kohli2,*

1 Massachusetts Institute of Technology 2 DeepMind * Work done when the author was with Microsoft Research

Scene De-renderingGoal: a compact, interpretable scene representation

Applications

Motivation• An object-based, disentangled representation has wide applications• Representations learned by current deep nets are hard to interpret

InpaintingSolution: looping in a forward graphics engine in recognitionAdvantages• Graphics engines bring in symbolic representation naturally• Graphics engines generalize well to a variable number of objects• The learned representation is rich, and has wide applications.

Model

<objects><balloon: right><bench: yellow><tree: right><boy: stand happy><girl: sit sad>

</objects>

<objects><pig: left close><villager: left><tree: tall right>

</objects>

De-render

Render

De-render

Render

(a) Input image

Interpreting proposals

Rendering images

(c) Inference

Proposingsegments

Applications

(d) Rendered image(I) (II) (III)

(b) Segment proposals

Imageediting:

Captioning: The boy is …

Inpainting, analogy-making, …

(prediction loss) (reconstruction loss)

Scene XML<object>

<category>triangle</category><size>1.5</size><color>blue</color><position>1.5,2,1</position><yaw>0</yaw>…

</object><object>…

(c) Original image(a) Corrupted input (b) Reconstruction

: :: :

: :: :

: :: :Reference 𝐴′Reference 𝐴 Query 𝐵 Prediction 𝐵′

: :: :

Analogy-Making

GraphicsEngine

Inference & Reconstruction

If = 0 If = 1

(a) Input image (c) NSD(b) CNN+LSTM (d) NSD (full)

Results

Training Steps• Supervised pre-training with the prediction loss• End-to-end fine-tuning with the reconstruction loss (with REINFORCE)

• Graphics engines as generalized decoders• Visually distinctive images may have similar representations• Solution: Optimizing in both spaces

jenny and mike are having fun in the sandbox unaware of the storm that's coming their way

Caption Retrieval