LayoutVAE: Stochastic Scene Layout Generation …3. Background In this section, we ﬁrst deﬁne the problem of scene layout generation from a label set, and then provide an overview

LayoutVAE: Stochastic Scene Layout Generation From a Label Set

Akash Abdu Jyothi1,3, Thibaut Durand1,3, Jiawei He1,3, Leonid Sigal2,3, Greg Mori1,31Simon Fraser University 2University of British Columbia 3Borealis AI{aabdujyo, tdurand, jha203}@sfu.ca [email protected] [email protected]

Abstract

Recently there is an increasing interest in scene gen-eration within the research community. However, mod-els used for generating scene layouts from textual descrip-tion largely ignore plausible visual variations within thestructure dictated by the text. We propose LayoutVAE, avariational autoencoder based framework for generatingstochastic scene layouts. LayoutVAE is a versatile model-ing framework that allows for generating full image layoutsgiven a label set, or per label layouts for an existing im-age given a new label. In addition, it is also capable ofdetecting unusual layouts, potentially providing a way toevaluate layout generation problem. Extensive experimentson MNIST-Layouts and challenging COCO 2017 Panop-tic dataset verifies the effectiveness of our proposed frame-work.

1. IntroductionScene generation, which usually consists of realistic gen-

eration of multiple objects under a semantic layout, remainsone of the core frontiers of computer vision. Despite therapid progress and recent successes in object generation(e.g., celebrity face, animals, etc.) [1, 9, 13] and scene gen-eration [4, 11, 12, 19, 22, 30, 31], little attention has beenpaid to frameworks designed for stochastic semantic layoutgeneration. Having a robust model for layout generationwill not only allow us to generate reliable scene layouts,but also provide priors and means to infer latent relation-ships between objects, advancing progress in the scene un-derstanding domain.

A plausible semantic layout calls for reasonable spatialand count relationships (relationships between the numberof instances of different labels) between objects in a scene[5, 27]. For example, a person would either ride (on top of )a horse, or stand next to a horse, but seldom would he be un-der a horse. Another example would be that the number ofties would very likely be smaller than or equal to the numberof people in an image. The first example shows an instanceof a plausible spatial relationship and the second shows a

Figure 1: Several images from COCO that have the samelabel set: person, surfboard and sea. Given this simple labelset, we observe that a large and diverse set of layouts isplausible.

case of a generic count relationship. Such intrinsic rela-tionships buried in high-dimensional visual data are usuallylearned implicitly by mapping the textual description to vi-sual data. However, since the textual description can alwaysbe treated as an abstraction of the visual data, the processbecomes a one-to-many mapping. In other words, given thesame text information as a condition, a good model shouldbe able to generate multiple plausible images all of whichsatisfy the semantic description.

Previous works focused on a popular simplified instanceof the problem described above: scene generation based onsentence description [6, 11, 12, 19, 24, 25, 30]. A typicalsentence description includes partial information on boththe background and objects, along with details of the ob-jects’ appearances and scene layout. These frameworks relyheavily on the extra relational information provided by thesentence. As a result, although these methods manage togenerate realistic scenes, they tend to ignore learning theintrinsic relationships between the objects, prohibiting thewide adoption of such models where weaker descriptionsare provided.

In this work, we consider a more sophisticated problem:

1

arX

iv:1

907.

1071

9v2

[cs

.CV

] 1

3 A

ug 2

019

scene generation based on a label set description. A labelset, as a much weaker description, only provides the set oflabels present in the image (without any additional relation-ship description), requiring the model to learn spatial andcount relationships from visual data.

Furthermore, the ambiguity of this type of label set su-pervision calls for diverse scene generation. For example,given the label set person, surfboard, sea, a correspondingscene could have multiple instances of each label (underplausible count relationships), positioned at various loca-tions (under plausible spatial relationships). For instance inthe COCO dataset [21], there are 869 images in the trainingset that have the label set person, sea and surfboard. Fig-ure 1 shows examples of multiple plausible images with thislabel set.

We propose LayoutVAE, a stochastic model capable ofgenerating scene layouts given a label set. The proposedframework can be easily embedded into existing scene gen-eration models that take scene layout as input, such as[10, 31], providing them plausible and diverse layouts. Ourmain contributions are as follows.

• We propose a new model for stochastic scene layoutgeneration given a label set input. Our model has twocomponents, one to model the distributions of countrelationships between objects and another to model thedistributions of spatial relationships between objects.

• We propose a new synthetic dataset, MNIST-Layouts,that captures the stochastic nature of scene layout gen-eration problem.

• We experimentally validate our model using MNIST-Layouts and the COCO [21] dataset which containscomplex real world scene layouts. We analyze ourmodel and show that it can be used to detect unlikelyscene layouts.

2. Related Work

Sentence-conditioned image generation. A variety ofmodels have proposed to generate an image given a sen-tence. Reed et al. [25] use a GAN [7] that is conditioned ona text encoding for generating images. Zhang et al. [30] pro-pose a GAN based image generation framework where theimage is progressively generated in two stages at increas-ing resolutions. Reed et al. [24] perform image generationwith sentence input along with additional information in theform of keypoints or bounding boxes.

Hong et al. [11] break down the process of generatingan image from a sentence into multiple stages. The inputsentence is first used to predict the objects that are presentin the scene, followed by prediction of bounding boxes,then semantic segmentation masks, and finally the image.

While scene layout generation in this work predicts proba-bility distributions for bounding box layout, it fails to modelthe stochasticity intrinsic in predicting each bounding box.Gupta et al. [8] use an approach similar to [11] to pre-dict layouts for generating videos from scripts. Johnson etal. [12] uses the scene graph generated from the input sen-tence as input to the image generation model. Given a scenegraph, their model can generate only one scene layout.

Deng et al. [6] propose PNP-Net, a VAE framework togenerate image of an abstract scene from a text based pro-gram that fully describes it. While PNP-Net is a stochas-tic model for generation, it was tested on synthetic datasetswith only a small number of classes. Furthermore, it tries toencode the entire image into a single latent code whereas, inLayoutVAE, we break down just the layout generation stepinto two stages with multiple steps in each stage. Based onthese reasons, it is unclear whether PNP-Net can scale upto real world image datasets with a large number of classes.Tao et al. [29] propose a GAN based model with attentionfor sentence to image generation. The more recent workfrom Li et al. [19] follow a multi-stage approach similar to[11] to generate an image from a sentence, with the key nov-elty of using attention mechanisms to create more realisticobjects in the image.

Layout generation in other contexts. Chang et al. [3] pro-pose a method for 3D indoor scene generation based on textdescription by placing objects from a 3D object library, andis later improved in [2] by learning the grounding of moredetailed text descriptions with 3D objects. Wang et al. [28]use a convolutional network that iteratively generates a 3Droom scene by adding one object at a time. Qi et al. [23]propose a spatial And-Or graph to represent indoor scenes,from which new scenes can be sampled. Different frommost other works, they use human affordances and activ-ity information with respect to objects in the scene to modelprobable spatial layouts. Li et al. [18] propose a VAE basedframework that encodes object and layout information ofindoor 3D scenes in a latent code. During generation, thelatent code is recursively decoded to obtain details of indi-vidual objects and their layout.

More recently, Li et al. [17] proposed LayoutGAN, aGAN based model that generates layouts of graphic ele-ments (rectangles, triangles etc.). While this work is closeto ours in terms of the problem focus, LayoutGAN gener-ates label sets based on input noise, and it cannot generatelayout for a given set of labels.

Placing objects in scenes. Lee et al. [16] propose a condi-tional GAN model for the problem of adding the segmen-tation mask of a new object to the semantic segmentationof an image. Lin et al. [20] address a similar problem ofadding an RGB mask of an object into a background im-age.

2

3. BackgroundIn this section, we first define the problem of scene

layout generation from a label set, and then provide anoverview of the base models that LayoutVAE is built upon.

3.1. Problem Setup

We are interested in modeling contextual relationshipsbetween objects in scenes, and furthermore generation ofdiverse yet plausible scene layouts given a label set as input.The problem can be formulated as follows.

Given a collection of M object categories, we representthe label set corresponding to an image in the dataset asL ⊆ {1, 2, 3, ...,M} which indicates the categories presentin the image. Note that here we use the word “object” in itsvery general form: car, cat, person, sky and water are ob-jects. For each label k ∈ L, let nk be the number of objectsof that label in the image and Bk = {bk,1,bk,2, ...,bk,nk}be the set of bounding boxes. bk,i = [xk,i, yk,i, wk,i, hk,i]represents the top-left coordinates, width and height of thei-th bounding box of category k. We train a generativemodel to predict diverse yet plausible sets of {Bk : k ∈ L}given the label set L as input.

3.2. Base Models

Variational Autoencoders. A variational autoencoder(VAE) [15] describes an instance of a family of genera-tive models pθ(x, z) = pθ(x|z)pθ(z) with a complex likeli-hood function pθ(x|z) and an amortized inference networkqφ(z|x) to approximate the true posterior pθ(z|x). Here xrepresents observable data examples, z the latent codes, θthe generative model parameters, and φ the inference net-work parameters. To prevent the latent variable z fromjust copying x, we force qφ(z|x) to be close to the priordistribution pθ(z) using a KL-divergence term. Usually inVAE models, pθ(z) is a fixed Gaussian N (0, I). Both thegenerative and the inference networks are realized as non-linear neural networks. An evidence lower bound (ELBO)L(x; θ, φ) on the generative data likelihood log p(x) is usedto jointly optimize θ and φ:

L(x; θ, φ) = Eqφ(z|x) [log pθ(x|z)]− KL (qφ(z|x)||pθ(z))(1)

Conditional VAEs. A conditional VAE (CVAE) [26] de-fines an extension of VAE that conditions on an auxiliarydescription c of the data. The auxiliary conditional variablemakes it possible to infer a conditional posterior qφ(z|x, c)as well as perform generation pθ(x|z, c) based on a givendescription c. The ELBO is thus updated as:

LCV AE(x, c; θ, φ) =Eqφ(z|x,c) [log pθ(x|z, c)]

− KL (qφ(z|x, c)||pθ(z|c))(2)

In CVAE models, the prior of the latent variables z is mod-ulated by the auxiliary input c.

4. LayoutVAE for Stochastic Scene LayoutGeneration

In this section, we present LayoutVAE and describe thescene layout generation process given a label set. As dis-cussed in section 1, this task is challenging and solving itrequires answering the following two questions: what is thenumber of objects for each category? and what is the lo-cation and size for each object? LayoutVAE is naturallydecomposed into two models: one to predict the count foreach given label, named CountVAE, and another to predictthe location and size of each object, named BBoxVAE. Theoverall structure of the proposed LayoutVAE is shown inFigure 2. The number of objects (count) for each label isfirst predicted by CountVAE, then BBoxVAE predicts thebounding box for each object. The two-step approach withstochastic models naturally allows LayoutVAE to generatediverse layouts. In addition, it provides the flexibility tohandle various types of input as it allows us to use eachmodule independently. For example, BBoxVAE can be usedto generate a layout if counts are available, or add a singlebounding box in an existing image given a new label.

The input to CountVAE is the set of labels L and it pre-dicts the distributions of object counts {nk : k ∈ L} au-toregressively, where nk is the object count for category k.The input of BBoxVAE is the set of labels along with thecounts for each of the label {nk : k ∈ L} and it predicts thedistribution of each bounding box bk,i autoregressively.

4.1. CountVAE

CountVAE is an instance of conditional VAE designedto predict conditional count distribution for the labels in anautoregressive manner. We use a predefined order for thelabel set (we observe empirically that a predefined order issuperior to randomized order across samples; learning anorder is a potential extension but adds complexity). In prac-tice, CountVAE predicts the count of the first label, then thecount of the second label etc., at each step conditioned onalready predicted counts. It models the distribution of countnk given the label set L, the current label k and the countsfor each category that was predicted before {nm : m < k}.The conditioning input for CountVAE is:

cck = (L, k, {nm : m < k}) (3)

where (·, ·) denotes a tuple. We use the notation of super-script c to indicate that it is related to the CountVAE. We usea Poisson distribution to model the number of occurrencesof the current label at each step:

pθc(nk|zck, cck) =(λ(zck, c

ck))(nk−1)e−λ(z

ck,c

ck)

(nk − 1)!(4)

where pθc(nk|zck, cck) is a probability distribution over ran-dom variable nk, θc represents the generative model param-eters of CountVAE, and λ(zck, c

ck) is the rate which depends

3

Figure 2: Model. LayoutVAE is composed of two models: CountVAE which predicts the number of objects for eachcategory and BBoxVAE which predicts the bounding box of each object. A graphical model is given in Appendix A of thesupplementary.

Algorithm 1 CountVAE: Loss computation on the image

Input: Label set L, instance count {nk : k ∈ L}1: LCount = 02: for k ∈ L do3: Compute the conditioning input cck for category k

(Equation 3)4: Compute the variational lower bound

Lc(nk, cck; θc, φc) (Equation 7)5: LCount = LCount + Lc(nk, cck; θc, φc)6: end for

Output: LCount/|L|

on the latent variable sample zck and the conditional inputcck. Note that we learn the distribution over nk−1 since thecount per label is always 1 or greater in this problem setting.

The latent variable is sampled from the approximate pos-terior during learning and from the conditional prior dur-ing generation. The latent variables model the ambiguity inthe scene layout. Both approximate posterior and prior aremodeled as multivariate Gaussian distributions with diago-nal covariance with parameters as shown below:

qφc(zck|nk, cck) = N (µφc(nk, c

ck), σ2

φc(nk, cck)) (5)

pθc(zck|cck) = N (µθc(c

ck), σ2

θc(cck)) (6)

where µφc (resp. µθc ) and σ2φc (resp. σ2

θc ) are functionsthat estimate the mean and the variance of the approximateposterior (resp. prior). Details of the model architecture aregiven in subsection B.1 of the supplementary.

Learning. The model is optimized by maximizing theELBO over {nk : k ∈ L}. The ELBO corresponding to the

label count nk is given by

Lc(nk, cck; θc, φc) =Eqφc (zck|nk,cck) [log pθc(nk|zck, cck)] (7)

− KL (qφc(zck|nk, cck)||pθc(zck|cck))

where φc represents the inference model parameters ofCountVAE. Log-likelihood of the ground truth count un-der the predicted Poisson distribution is used to computepθc(nk|zck, cck), while the KL divergence between the twoGaussian distributions is computed analytically. The com-putation of the loss for the label set L is given in Algo-rithm 1.

Generation. Given a label set L, the CountVAE autore-gressively predicts the object count for each category bysampling from the count distribution (Equation 4). We nowpresent the generation process to predict the object count ofcategory k. We first compute the conditional input cck:

cck = (L, k, {nm : m < k}) (8)

where nm is the predicted instance count for category m.Note that to predict the instance count of category k, themodel exploits the predicted counts nm for previous cat-egories m < k to get more consistent counts. Then, wesample a latent variable zck from the conditional prior:

zck ∼ pθc(zck|cck) (9)

Finally, the count is sampled from the predicted Poissoncount distribution:

nk ∼ pθc(nk|zck, cck) (10)

This label count is further used in the conditioning variablefor the future steps of CountVAE.

4

Algorithm 2 BBoxVAE: Loss computation on the image

Input: Label set L, instance count {nk : k ∈ L}, set ofbounding boxes {Bk : k ∈ L}

1: LBBox = 02: for k ∈ L do3: for j ∈ {1, . . . , nk} do4: Compute the conditioning input cbk,j for the j-th

bounding box of category k (Equation 11)5: Compute the variational lower bound

Lb(bk,j , cbk,j ; θb, φb) (Equation 13)6: LBBox = LBBox + Lb(bk,j , cbk,j ; θb, φb)7: end for8: end for

Output: LBBox/(∑

k∈L nk)

4.2. BBoxVAE

Given the label set L and the object count per category{nm : m ∈ L}, the BBoxVAE predicts the distribution ofcoordinates for the bounding boxes autoregressively. Wefollow the same predefined label order as CountVAE in thelabel space, and order the bounding boxes from left to rightfor each label. All bounding boxes for a given label are pre-dicted before moving on to the next label. BBoxVAE is aconditional VAE that models the jth bounding box bk,j forthe label k given the label set L along with the count foreach label, current label k and coordinate and label infor-mation of all the bounding boxes that were predicted earlier.Previous predictions include all the bounding boxes for pre-vious labels as well as the ones previously predicted for thecurrent label: Bprevk,j = {bm,: : m < k} ∪ {bk,i : i < j}The conditioning input of the BBoxVAE is:

cbk,j =({nm : m ∈ L}, k,Bprevk,j

)(11)

We use the notation of exponent b to indicate that it is re-lated to the BBoxVAE. We model bounding box coordinatesand size information using a quadrivariate Gaussian distri-bution as shown in Equation 12. We omit the subscript k, jfor all variables in the equation for brevity.

pθb(x, y, w, h|zb, cb)=N(x, y, w, h|µb(zb, cb), σb(zb, cb)

)(12)

where θb represents the generative model parameters ofBBoxVAE, zb denotes a sampled latent variable and µb andσb are functions that estimate the mean and the variance re-spectively for the Gaussian distribution. x (resp. y) is thex- (resp. y-) coordinate of the top left corner and w (resp.h) is the width (resp. height) of the bounding box. Thesevariables are normalized between 0 and 1 to be independentof the dimensions of the image. Details of the model archi-tecture are given in subsection B.2 of the supplementary.

Figure 3: Samples from MNIST-Layouts dataset. Werandomly sample MNIST digits of appropriate labels to fillin the bounding boxes. The rules for count and spatial lay-out (e.g. large 2s in the middle, small 3s at the bottom etc. )are described in subsection C.1 in the supplementary.

Learning. We train the model in an analogous manner toCountVAE by maximizing the variational lower bound overthe entire set of bounding boxes. For the bounding box bk,j ,the lower bound is given by (omitting the subscript k, j forall variables in the equation):

Lb(b, cb; θb, φb) =Eqφb

(zb|b,cb)[log pθb(b|zb, cb)

](13)

− KL(qφb(z

b|b, cb)||pθb(zb|cb))

where φb represents the inference model parameters ofBBoxVAE. The computation of the loss for all the boundingboxes is given in Algorithm 2.

Generation. Generation for BBoxVAE is performed in ananalogous manner as CountVAE. Given a label set alongwith the number of instances of each label, BBoxVAE au-toregressively predicts bounding box distributions (Equa-tion 12) by sampling a latent variable from the conditionalprior pθb(zb|cb). The conditioning variable is updated aftereach step by including a sampled bounding box from thecurrent step prediction.

5. ExperimentsWe implemented LayoutVAE in PyTorch, and used

Adam optimizer [14] for training the models. CountVAEwas trained for 50 epochs at a learning rate of 10−5 withbatch size 32. BBoxVAE was trained for 150 epochs at alearning rate of 10−4 with batch size 32. We chose the bestmodel during training based on the validation loss, and eval-uate that model’s performance on the test set. We presentquantitative and qualitative results that showcase the use-fulness of LayoutVAE model.

5.1. Datasets

MNIST-Layouts. We created this dataset by placing multi-ple MNIST digits on a 128×128 canvas based on predefinedrules detailed in subsection C.1 in the supplementary mate-rial. We use the digits {1, 2, 3, 4} as the global set of labels,thus limiting the maximum number of labels per image to 4.The dataset consists of 5000 training images, and 1000 im-ages each for validation and testing. Figure 3 shows someexamples of scene layouts from the training set.

5

ModelLabel Count BBox

Accuracy(%)

Accuracywithin ±1 (%)

MeanIoU

AR-MLP 74.21 87.50 0.18BBoxLSTM - - 0.17BLSTM 73.98 87.87 0.15sg2im [12] - - 0.14LayoutVAE 78.38 89.87 0.20

Table 1: Comparison with baselines on COCO datasetusing accuracy metrics. We use the most likely count orbounding box as the prediction for all the models. Whileall models predict a distribution as output, only LayoutVAEhas a stochastic latent code. Mean of the latent code distri-bution is used to obtain the output (count or bounding box)distribution for measuring the accuracy of LayoutVAE.

COCO. We use 2017 Panoptic version of COCO dataset[21] for our experiments. This dataset has images fromcomplex everyday scenes with bounding box annotationsfor all major things and stuffs present in the scene. The of-ficial release has 118, 267 annotated images in the trainingset and 5000 in the validation set. To allow a comparisonwith future methods, we use the official validation set astest set, and use the last 5000 images from the training setas validation set in our experiments. This dataset has 80thing categories (person, table, chair etc.) and 53 stuff cate-gories (sky, sand, snow etc.). The largest number of labelspresent in a label set is 34, and the largest number of bound-ing box instances present in an image is 98. We normalizeall bounding box dimensions to [0, 1] by dividing by the im-age size (width or height, as appropriate). This allows themodel to predict layouts for square images of any size. Weignore thing instances that are tagged “iscrowd” i.e. con-sists of a group of thing instances. They account for lessthan 1 percent of all the annotations.

5.2. Baseline Models

To our knowledge, the problem of generating scene lay-outs (with both thing and stuff categories, and multiplebounding boxes per category covering almost the entireimage) from a label set is new and has not been studied.There is no existing model for this task so we adapt exist-ing models that generate scene layout from more complexinputs e.g. sentence, scene graphs. Note that we choose thesame distributions as in LayoutVAE for modeling the out-puts (counts or bounding box information) to have a faircomparison between all the models.

Autoregressive MLP (AR-MLP). This model is analogousto our proposed VAE based model, except that it has a multi-layer perceptron instead of a VAE. It runs autoregressivelyand takes as input the same conditioning information used

Model Label Count BBox

MNIST COCO MNIST COCO

AR-MLP 1.246 0.740 5.83 40.91BBoxLSTM - - 6.21 42.42BLSTM 1.246 0.732 20.06 52.84sg2im [12] - - - 214.03LayoutVAE 1.246 0.562 −0.07 2.72

Table 2: Comparison with baselines using likelihoodmetric. Negative log-likelihood (lower is better) over thetest set on MNIST-Layouts and COCO datasets.

in the VAE models. As in LayoutVAE, we have two sepa-rate sub-models – CountMLP for predicting count distribu-tion given the corresponding conditioning information, andBBoxMLP for predicting bounding box distribution giventhe corresponding conditioning information.

BBoxLSTM. This model is analogous to the bounding boxpredictor used in Hong et al. [11]. It consists of an LSTMthat takes in the label set embedding (as opposed to sen-tence embedding in [11]) at each step and predicts the labeland bounding box distributions for bounding boxes in theimage one by one. At each step, the LSTM output is firstused to generate the label distribution. The bounding boxdistribution is then generated conditioned on the label.

BLSTM. We also use bidirectional LSTM in an analo-gous fashion as LayoutVAE. We have CountBLSTM andBBoxBLSTM to predict count distribution and bounding boxdistribution respectively, where we input the respective con-ditioning information at each step for the BLSTM. The con-ditioning information is similar as in the VAE models ex-cept that we do not give the pooled embedding of previouscounts/bounding boxes information as we now have a bidi-rectional recurrent model. Note that having a bidirectionalmodel can possibly alleviate the need to explicitly define theorder in the label and the bounding box coordinates spaces.

sg2im [12]. This model generates a scene layout based ona scene graph. We compare LayoutVAE with this modelbecause label set (in this case, with multiple copies of la-bels as per the ground truth count of each label) can beseen as the simplest scene graph without any relations.We also explored another strategy where a scene graph israndomly generated for the label set, but we found thatmodel performance worsened in that setting. For this ex-periment, we use the code and the pretrained model onCOCO provided by the authors at https://github.com/google/sg2im to predict the bounding boxes.

Note that these models are limited in their ability tomodel complex distributions and generate diverse layouts,whereas LayoutVAE can do so using the stochastic latentvariable.

6

https://github.com/google/sg2im

https://github.com/google/sg2im

History ContextNLL

CountVAE BBoxVAE

0.592 4.173 0.570 3.78

3 0.581 2.933 3 0.562 2.72

Table 3: Ablation study. Test NLL results for CountVAEand BBoxVAE by prior sampling on COCO dataset. Thehistory is the previous counts for CountVAE and the previ-ous bounding box information for BBoxVAE. The contextis the label set for CountVAE and the label set with countsfor BBoxVAE.

things before stuffs stuffs before things random

2.72 2.71 3.22

Table 4: Analysis of label order. NLL for BBoxVAE fordifferent label set order on COCO dataset.

5.3. Quantitative Evaluation

To compare the models, we compute average negativelog-likelihood (NLL) of the test samples’ count or bound-ing box coordinates under the respective predicted distri-butions. We compute Monte Carlo estimate of NLL for theVAE models by drawing 1000 samples from the conditionalprior at each step of generation. LayoutVAE and the base-lines are trained and evaluated by teacher forcing i.e. weprovide the ground truth context and count/bounding boxhistory at each step of computation.

Comparison with baseline models. Table 1(accuracy met-rics) and Table 2(likelihood) show count and bounding boxgeneration performances for all the models. First, we ob-serve that LayoutVAE model is significantly better than allthe baselines. In particular the large improvement with re-spect to the autoregrssive MLP baseline shows the relevanceof stochastic model for this task. It is also interesting tonote that autoregressive MLP model performs better thatthe recurrent models. Finally, we observe that sg2im model[12] is not able to predict accurate bounding boxes with-out a scene graph with relationships. LayoutVAE and thebaselines show similar performance for count prediction inMNIST-Layouts because estimating count distribution overthe 4 labels of MNIST-Layouts is much easier than in thereal world data from COCO.

Ablation study. In Table 3, we analyze the importance ofcontext representation and history in the conditioning infor-mation. We observe that both history and context are use-ful because they increase the log-likelihood but their effectsvary for count and bounding box models. The context ismore critical than the history for the CountVAE whereas thehistory is more critical than the context for the BBoxVAE.

original image original layout flipped layout

Figure 4: Example where likelihood under LayoutVAE in-creases when flipped upside down. We can see that theflipped layout is equally plausible for this example.

Figure 5: Image Generation from a label set. We showdiverse realistic layouts generated by LayoutVAE for the in-put label set person, surfboard and sea. We generate imagesfrom the layouts by using the generative model provided byZhao et al. [31]

Despite these different behaviours, we note that history andcontext are complementary and increase the performancesfor both CountVAE and BBoxVAE. This experiment vali-dates the importance of both context and history in the con-ditioning information for both CountVAE and BBoxVAE.

Analysis of the label set order. We performed experimentsby changing the order of labels in the label set. We considerthree variants for this experiments — a fixed order with allthe things before stuffs (default choice for all the other ex-periments), reverse order of the above with all the stuffs be-fore things, and finally we randomly assigned the order oflabels for each image. Table 4 shows the results of the ex-periment on BBoxVAE. We found that fixed order (thingsfirst or stuffs first) gave similar results, while randomizingthe order of labels across images resulted in a significantreduction in performance.

Detecting unlikely layouts in COCO. To test the abilityof the model to differentiate between plausible and unlikelylayouts, we perform an experiment where we flip the imagelayouts in the test set upside down, and evaluate the NLL (inthis case, by importance sampling using 1000 samples fromthe posterior) of both the normal layout and the flipped lay-out under BBoxVAE. We found that the NLL got worse for92.58% of image layouts in the test set when flipped up-side down. The average NLL for original layouts is 2.26while that for the flipped layouts is 4.33. We note that thereare instances where flipping does not necessarily lead to an

7

(a) Input label set {1, 2, 3, 4} (b) Input label set {2, 3, 4}

(c) Input label set {3, 4} (d) Input label set {1, 3}Figure 6: Stochastic layout generation for MNIST-Layouts. We show 5 sample layouts generated by LayoutVAE for eachinput set of labels. We fill in the generated bounding boxes with randomly sampled MNIST digits of appropriate labels. Wecan see that LayoutVAE generates bounding boxes following all the rules that we defined for MNIST-Layouts (e.g. large 2s inthe middle, small 3s at the bottom etc. with the appropriate count values). The complete set of rules is given in subsection C.1in the supplementary material.

person person sports ball baseball glove playingfield tree-mergedFigure 7: Bounding box prediction using LayoutVAE. We show the steps of bounding box generation for a test set sample.At each step, we obtain diverse bounding box predictions for the input label (written in subcaptions) given the ground truthlayout up to the previous step (shown in cyan bounding boxes). We show bounding box heatmap using 100 samples from theprior distribution (top row for each step, red means high probability), and bounding boxes using 20 samples from the priordistribution. The predicted layouts are overlaid on the test image for clarity.

anomalous layout, which explains why likelihood worsenedonly for 92.58% of the test set layouts. Figure 4 shows suchan example where flipping led to an increased likelihoodunder the model. Additional results are shown in subsec-tion C.3 in the supplementary material.

5.4. Qualitative Evaluation

Scene layout samples. We present qualitative results fordiverse layout generation on COCO(Figure 5) datasets andMNIST-Layouts(Figure 6). The advantages of modelingscene layout generation using a probabilistic model is ev-ident from these results: LayoutVAE learns the rules of ob-ject layout in the scene and is capable of generating diverselayouts with different counts of objects. More examples canbe found in subsection C.4 in the supplementary material.

Per step bounding box prediction for COCO. Figure 7shows steps of bounding box generation for a test set of

labels. For each step, we show the probability map and20 plausible bounding boxes. We observe that given somebounding boxes, the model is able to predict where the nextobject could be in the image. More examples for boundingbox generation are provided in subsection C.5 in the sup-plementary material.

6. ConclusionWe propose LayoutVAE for generating stochastic scene

layouts given a label set as input. It comprises of two com-ponents to model the distribution of spatial and count rela-tionships between objects in a scene. We compared it withother existing methods or analogues thereof, and showedsignificant performance improvement. Qualitative visual-izations show that LayoutVAE can learn intrinsic relation-ships between objects in real world scenes. Furthermore,LayoutVAE can be used to detect abnormal layouts.

8

References[1] Navaneeth Bodla, Gang Hua, and Rama Chellappa. Semi-

supervised FusedGAN for Conditional Image Generation. InEuropean Conference on Computer Vision (ECCV), 2018. 1

[2] Angel Chang, Will Monroe, Manolis Savva, ChristopherPotts, and Christopher D. Manning. Text to 3D Scene Gener-ation with Rich Lexical Grounding. In Association for Com-putational Linguistics and International Joint Conference onNatural Language Processing (ACL-IJCNLP), 2015. 2

[3] Angel X Chang, Manolis Savva, and Christopher D Man-ning. Learning Spatial Knowledge for Text to 3D SceneGeneration. In Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), 2014. 2

[4] Qifeng Chen and Vladlen Koltun. Photographic Image Syn-thesis With Cascaded Refinement Networks. In IEEE Inter-national Conference on Computer Vision (ICCV), Oct 2017.1

[5] Myung Jin Choi, Joseph J Lim, Antonio Torralba, and Alan SWillsky. Exploiting hierarchical context on a large databaseof object categories. In IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR), 2010. 1

[6] Zhiwei Deng, Jiacheng Chen, Yifang Fu, and Greg Mori.Probabilistic Neural Programmed Networks for Scene Gen-eration. In Advances in Neural Information Processing Sys-tems (NeurIPS), 2018. 1, 2

[7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative Adversarial Nets. In Advancesin Neural Information Processing Systems (NeurIPS). 2014.2

[8] Tanmay Gupta, Dustin Schwenk, Ali Farhadi, Derek Hoiem,and Aniruddha Kembhavi. Imagine This! Scripts to Com-positions to Videos. In European Conference on ComputerVision (ECCV), 2018. 2

[9] Yang He, Bernt Schiele, and Mario Fritz. Diverse Condi-tional Image Generation by Stochastic Regression with La-tent Drop-Out Codes. In European Conference on ComputerVision (ECCV), 2018. 1

[10] Tobias Hinz, Stefan Heinrich, and Stefan Wermter. Generat-ing Multiple Objects at Spatially Distinct Locations. In In-ternational Conference on Learning Representations (ICLR),2019. 2

[11] Seunghoon Hong, Dingdong Yang, Jongwook Choi, andHonglak Lee. Inferring Semantic Layout for HierarchicalText-to-Image Synthesis. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2018. 1, 2, 6

[12] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image Gener-ation from Scene Graphs. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2018. 1, 2, 6, 7

[13] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.Progressive Growing of GANs for Improved Quality, Stabil-ity, and Variation. In International Conference on LearningRepresentations (ICLR), 2018. 1

[14] Diederik P. Kingma and Jimmy Ba. Adam: A Method forStochastic Optimization. In International Conference onLearning Representations (ICLR), 2015. 5

[15] Diederik P. Kingma and Max Welling. Auto-Encoding Vari-ational Bayes. International Conference on Learning Repre-sentations (ICLR), 2014. 3

[16] Donghoon Lee, Ming-Yu Liu, Ming-Hsuan Yang, Sifei Liu,Jinwei Gu, and Jan Kautz. Context-Aware Synthesis andPlacement of Object Instances. In Advances in Neural In-formation Processing Systems (NeurIPS), 2018. 2

[17] Jianan Li, Tingfa Xu, Jianming Zhang, Aaron Hertzmann,and Jimei Yang. LayoutGAN: Generating Graphic Layoutswith Wireframe Discriminator. In International Conferenceon Learning Representations (ICLR), 2019. 2

[18] Manyi Li, Akshay Gadi Patil, Kai Xu, Siddhartha Chaudhuri,Owais Khan, Ariel Shamir, Changhe Tu, Baoquan Chen,Daniel Cohen-Or, and Hao Zhang. GRAINS: Generative Re-cursive Autoencoders for Indoor Scenes. ACM Transactionson Graphics (TOG), 2018. 2

[19] Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang,Xiaodong He, Siwei Lyu, and Jianfeng Gao. Object-drivenText-to-Image Synthesis via Adversarial Training. In IEEEConference on Computer Vision and Pattern Recognition(CVPR), 2019. 1, 2

[20] Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shecht-man, and Simon Lucey. ST-GAN: Spatial Transformer Gen-erative Adversarial Networks for Image Compositing. InIEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), 2018. 2

[21] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollar, and C LawrenceZitnick. Microsoft COCO: Common Objects in Context. InEuropean Conference on Computer Vision (ECCV), 2014. 2,5

[22] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-YanZhu. Semantic Image Synthesis with Spatially-AdaptiveNormalization. In IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2019. 1

[23] Siyuan Qi, Yixin Zhu, Siyuan Huang, Chenfanfu Jiang, andSong-Chun Zhu. Human-centric Indoor Scene Synthesis Us-ing Stochastic Grammar. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2018. 2

[24] Scott Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka,Bernt Schiele, and Honglak Lee. Learning What and Whereto Draw. In Advances in Neural Information Processing Sys-tems (NeurIPS), 2016. 1, 2

[25] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-geswaran, Bernt Schiele, and Honglak Lee. Generative Ad-versarial Text-to-Image Synthesis. In International Confer-ence on Machine Learning (ICML), 2016. 1, 2

[26] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. LearningStructured Output Representation using Deep ConditionalGenerative Models. In Advances in Neural Information Pro-cessing Systems (NeurIPS). 2015. 3

[27] Erik B Sudderth, Antonio Torralba, William T Freeman, andAlan S Willsky. Learning hierarchical models of scenes, ob-jects, and parts. In IEEE International Conference on Com-puter Vision (ICCV), 2005. 1

[28] Kai Wang, Manolis Savva, Angel X Chang, and DanielRitchie. Deep convolutional priors for indoor scene synthe-sis. ACM Transactions on Graphics (TOG), 2018. 2

9

[29] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, ZheGan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-Grained Text to Image Generation with Attentional Gener-ative Adversarial Networks. In IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), 2018. 2

[30] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-gang Wang, Xiaolei Huang, and Dimitris Metaxas. Stack-GAN: Text to Photo-realistic Image Synthesis with StackedGenerative Adversarial Networks. In IEEE InternationalConference on Computer Vision (ICCV), 2017. 1, 2

[31] Bo Zhao, Lili Meng, Weidong Yin, and Leonid Sigal. ImageGeneration from Layout. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2019. 1, 2, 7

10

A. Graphical ModelFor each of CountVAE and BBoxVAE, we have a conditional VAE that is autoregressive (Figure 15), where the condi-

tioning variable contains all the information required for each step of generation. By explicitly designing the conditioningvariable this way, we forgo using past latent codes for generation.

. . . . . .

Figure 15: Graphical model for LayoutVAE. c denotes context (label set for CountVAE, label set with counts for BBox-VAE) and ck denotes the conditioning information for the CVAE. Dashed arrow denotes inference of the approximate poste-rior.

B. Model ArchitectureB.1. CountVAE

We represent the label set L by a multi-label vector s ∈ {0, 1}M , where sk = 1 (resp. 0) means the k-th category is present(resp. absent). For each step of CountVAE, the current label k is represented by a one-hot vector denoted as lk ∈ {0, 1}M .We represent the count information nk of category k as a M dimensional one-hot vector yk with the non-zero location filledwith the count value. This representation of count captures its label information as well. We perform pooling over a set ofpreviously predicted label counts by summing up the vectors. The conditioning input is

cck = FC

([MLP(s),MLP(lk),MLP

(∑m<k

yk

)])(27)

where [·, ·] denotes concatenation, FC a fully connected layer and MLP a generic multi-layer perceptron. Figure 16 showsthe architecture in detail.

B.2. BBoxVAE

We represent label and count information pair for each category {nm : m ∈ L} as ym using the same strategy as inCountVAE. Pooled representation of the label set along with counts is obtained by summing up these vectors to obtain amulti-label vector. For each step of BBoxVAE, the current label k is represented by a one-hot vector denoted as lk. We useLSTM for pooling previously predicted bounding boxes Bprevk,j . We represent each bounding box as a vector of size 4, andwe concatenate M dimensional label vector to add label information to it. We pass M + 4 dimensional vectors of successivebounding boxes through an LSTM and use the final step output as the pooled representation. Figure 18 shows the poolingoperation, and Figure 17 shows the detailed architecture for each module in BBoxVAE. The conditioning input is

cbk,j = FC

([MLP

(∑m∈L

ym

),MLP(lk),MLP(Bprevk,j )

]). (28)

BBoxVAE predicts the mean for the quadrivariate Gaussian (Equation 12), while covariance is assumed to be a diagonalmatrix with each value of standard deviation equal to 0.02.

11

Figure 16: CountVAE Architecture.

12

Figure 17: BBoxVAE Architecture.

13

. . .

. . .

Figure 18: LSTM pooling for bounding boxes. We use an LSTM to pool the set of previously predicted bounding boxes tobe used in the conditioning information for BBoxVAE.

14

C. ExperimentsC.1. MNIST-Layouts dataset

Table 9 shows the rules used to generate the dataset. To generate the layouts, we adapted the code provided at https://github.com/aakhundov/tf-attend-infer-repeat.

Label Count Location Size

1 3,4 top medium2 2,3 middle large3 1,2 bottom small-medium

4 (2 present) count(2)+3,6 around a 2 small

4 (2 absent) 2 bottom-right small

Table 9: Rules for generating MNIST-Layouts dataset. Given a label set, we use uniform distribution over the possiblecount values to generate count. We then sample over a uniform distribution over the location and size ranges (precise detailsskipped in the table for brevity) to generate bounding boxes for each label instance. When label 4 is present in the input labelalong with label 2 (4th row in the table), we randomly choose an instance of 2 and place all the 4s around that.

C.2. Analysis of latent code size

In Table 10, we analyze the dimension of the latent space for both CountVAE and BBoxVAE. For each dimension wereport the NLL performance on COCO dataset. We observe that both models have good performance when the latent codesize is between 32 and 128, and the models are not sensitive to this hyperparameter. Increasing the latent space beyond 128does not improve the performances.

Latent Code Size CountVAE BBoxVAE

2 0.569 4.134 0.568 3.118 0.565 2.9116 0.564 2.7232 0.562 2.7264 0.563 2.69128 0.562 2.70

Table 10: Effect of latent code size. Average NLL over COCO test set for CountVAE and BBoxVAE while varying the sizeof the latent code.

15

https://github.com/aakhundov/tf-attend-infer-repeat

https://github.com/aakhundov/tf-attend-infer-repeat

C.3. Detecting unlikely layouts

In this section, we present more examples from our experiment on detecting unlikely layouts by flipping the originallayout and computing likelihood under LayoutVAE. In Figure 19, we present some typical examples where likelihood underLayoutVAE (BBoxVAE, to be precise, since CountVAE gives the same result for original and flipped layouts as the labelcounts remain the same) decreases when flipped upside down. This behaviour was observed for 92.58% samples in the testset. In Figure 20, we present some examples of unusual layouts where likelihood under LayoutVAE increases when flippedupside down.

NLL = 1.81 NLL = 11.69 NLL = 1.95 NLL = 11.15

NLL = 2.15 NLL = 11.26 NLL = 0.71 NLL = 9.07

image image layout flipped layout image image layout flipped layout

Figure 19: Some examples where likelihood under BBoxVAE decreases when flipped upside down. We show the testimage, layout for the image and the flipped layout. Negative log likelihood(NLL) of the layout under BBoxVAE is shownalong with each layout. We can see that the flipped layout is highly unlikely in these examples.

NLL = 6.15 NLL = 1.89 NLL = 3.47 NLL = 2.16

NLL = 2.19 NLL = 0.90 NLL = 3.72 NLL = 2.62

image image layout flipped layout image image layout flipped layout

Figure 20: Some examples where likelihood under BBoxVAE increases when flipped upside down. We show the testimage, layout for the image and the flipped layout. Negative log likelihood(NLL) of the layout under BBoxVAE is shownalong with each layout. We can see that the flipped layout is equally or sometimes more plausible in these examples.

16

C.4. Examples for Layout Generation

Figure 21 shows examples of diverse layouts generated using LayoutVAE.

{person, snow, snowboard}

{person, snow, skis}

{person, playingfield, tennis racket}

{bird, sea}

{person, tie}

Figure 21: Layout generation using LayoutVAE. We show 5 randomly sampled layouts for each input label set.

17

C.5. Examples for Bounding Box Generation

Figure 22 presents examples that showcase the ability of LayoutVAE to use conditioning information to predict plausiblebounding boxes. We present additional examples of stochastic bounding box generation for test samples that have labelsperson, surfboard and sea in Figure 23 and Figure 24.

person person sports ball baseball glove playingfield tree-merged

person person person person person person

Figure 22: Importance of conditioning information. We present two examples here both of which have person as their firstlabel. We see that bounding boxes for the first label person (first column) are close to the center in the first example, whereasthey are smaller and close to the left side of the scene in the second example. This is because BBoxVAE gets count of persons(2 for the first example, and 12 for the second examples) in its conditioning information, which prompts the model to predictbounding boxes appropriately.

18

person person surfboard surfboard sea sky

person person surfboard surfboard sea sky

person person person surfboard sea sky

Figure 23: Test examples with labels person, surfboard, sea. We show steps of bounding box generation for test setsamples in the same manner as in Figure 7 from the main paper.

19

person person person person person surfboard

surfboard surfboard surfboard sand sea skyFigure 24: An example with more objects in the scene. We show steps of bounding box generation for test set samples inthe same manner as in Figure 7 from the main paper.

20

Documents

LayoutVAE: Stochastic Scene Layout Generation …3. Background In this section, we ﬁrst deﬁne the problem of scene layout generation from a label set, and then provide an overview