arXiv:1912.09421v1 [cs.CV] 19 Dec 2019 · to-action button of size (100px, 500px)”, “call-to-action-button is below logo”, etc. The goal is to synthesize a set of design layouts

Neural Design Network: Graphic Layout Generation with Constraints

Hsin-Ying Lee∗2, Weilong Yang1, Lu Jiang1, Madison Le1,Irfan Essa1,3, Haifeng Gong1, Ming-Hsuan Yang1,2

1Google Research 2University of California, Merced 3Georgia Institute of Technology

Figure 1: Graphic layout generation with user constraints. We present realistic use cases of the proposed model. Giventhe desired components and partial user-specified constraints among them, our model can generate layouts following theseconstraints. We also present example designs constructed based on the generated layouts.

Abstract

Graphic design is essential for visual communicationwith layouts being fundamental to composing attractive de-signs. Layout generation differs from pixel-level image syn-thesis and is unique in terms of the requirement of mutualrelations among the desired components. We propose amethod for design layout generation that can satisfy user-specified constraints. The proposed neural design network(NDN) consists of three modules. The first module predictsa graph with complete relations from a graph with user-specified relations. The second module generates a layoutfrom the predicted graph. Finally, the third module fine-tunes the predicted layout. Quantitative and qualitative ex-periments demonstrate that the generated layouts are visu-ally similar to real design layouts. We also construct realdesigns based on predicted layouts for a better understand-ing of the visual quality. Finally, we demonstrate a practicalapplication on layout recommendation.

∗ This work was done while Hsin-Ying Lee was an intern at Google,Inc

1. Introduction

Graphic design surrounds us on a daily basis, from im-age advertisements, movie posters, and book covers to morefunctional presentation slides, websites, and mobile appli-cations. Graphic design is a process of using text, images,and symbols to visually convey messages. Even for experi-enced graphic designers, the design process is iterative andtime-consuming with many false starts and dead ends. Thisis further exacerbated by the proliferation of platforms andusers with significantly different visual requirements anddesires.

In graphic design, layout – the placement and sizing ofcomponents (e.g., title, image, logo, banner, etc.) – playsa significant role in dictating the flow of the viewer’s atten-tion and, therefore, the order by which the information isreceived. Creating an effective layout requires understand-ing and balancing the complex and interdependent relation-ships amongst all of the visible components. Variations inthe layout change the hierarchy and narrative of the mes-sage.

In this work, we focus on the layout generation problemthat places components based on the component attributes,relationships among components, and user-specified con-straints. Figure 1 illustrates examples where users specify

1

arX

iv:1

912.

0942

1v1

[cs

.CV

] 1

9 D

ec 2

019

a collection of assets and constraints, then the model wouldgenerate a design layout that satisfies all input constraints,while remaining visually appealing.

Generative models have seen a success in rendering re-alistic natural images [6, 16, 27]. However, learning-basedgraphic layout generation remains less explored. Existingstudies tackle layout generation based on templates [2, 11]or heuristic rules [25], and more recently using learning-based generation methods [15, 23, 32]. However, these ap-proaches are limited in handling relationships among com-ponents. High-level concepts such as mutual relationshipsof components in a layout are less likely to be captured wellwith conventional generative models in pixel space. More-over, the use of generative models to account for user pref-erences and constraints is non-trivial. Therefore, effectivefeature representations and learning approaches for graphiclayout generation remain challenging.

In this work, we introduce neural design network (NDN),a new approach of synthesizing a graphic design layoutgiven a set of components with user-specified attributes andconstraints. We employ directional graphs as our featurerepresentation for components and constraints since the at-tributes of components (node) and relations among compo-nents (edge) can be naturally encoded in a graph. NDNtakes as inputs a graph constructed by desired componentsas well as user-specified constraints, and then outputs a lay-out where bounding boxes of all components are predicted.

NDN consists of three modules. First, the relation pre-diction module takes as input a graph with partial edges,representing components and user-specified constraints,and infers a graph with complete relationships among com-ponents. Second, in the layout generation module, themodel predicts bounding boxes for components in the com-plete graph in an iterative manner. Finally, in the refinementmodule, the model further fine-tunes the bounding boxes toimprove the alignment and visual quality.

We evaluate the proposed method qualitatively and quan-titatively on three datasets under various metrics to ana-lyze the visual quality. The three experimental datasets areRICO [3, 24], Magazine [32], and an image banner adver-tisement dataset collected in this work. These datasets rea-sonably cover several typical applications of layout designwith common components such as images, texts, buttons,toolbars and relations such as above, larger, around, etc.We construct real designs based on the generated layoutsto assess the quality. We also demonstrate the efficacy ofthe proposed model by introducing a practical layout rec-ommendation application.

We make the following contributions in this work:• We propose a new approach that can generate high-

quality design layouts for a set of desired componentsand user-specified constraints.• We validate that our method performs favorably

against existing models in terms of realism, alignment,and visual quality on three datasets.• We demonstrate real use cases that construct designs

from generated layouts and a layout recommendationapplication. Furthermore, we collect a real-world ad-vertisement layout dataset to broaden the variety of ex-isting layout benchmarks.

2. Related Work

Natural scene layout generation. Layout is often usedas the intermediate representation in image generation taskconditioned on text [8, 10, 31] or scene graph [14]. In-stead of directly learning the mapping from the source do-main (e.g., text and scene graph) to the image domain, thesemethods model the operation as a two-stage framework.They first predict layouts conditioned on the input sources,and then generate images based on the predicted layouts.Recently, Jyothi et al. propose the LayoutVAE [15], whichis a generative framework that can synthesize scene layoutgiven a set of labels. However, a graphic design layout hasseveral fundamental differences to a natural scene layout.The demands for relationship and alignment among com-ponents are strict in graphic design. A few pixels offsetsof components can either cause a difference in visual expe-rience or even ruin the whole design. The graphic designlayout does not only need to look realistic but also needs toconsider the aesthetic perspective.

Graphic design layout generation. Early work on designlayout or document layout mostly relies on templates [2,11], exemplars [20], or heuristic design rules [25, 30].These methods rely on predefined templates and heuristicrules, for which professional knowledge is required. There-fore, they are limited in capturing complex design distribu-tions. Other work leverages saliency maps [1] and atten-tion mechanisms [26] to capture the visual importance ofgraphic designs and to trace the user’s attention. Recently,generative models are applied to graphic design layout gen-eration [23, 32]. The LayoutGAN model [23] can generatelayouts consisting of graphic elements like rectangles andtriangles. However, the LayoutGAN model generates lay-out from input noises and fails to handle layout generationgiven a set of components with specified attributes, whichis the common setting in graphic design. The Layout Gen-erative Network [32] is a content-aware layout generationframework that can render layouts conditioned on attributesof components. While the goals are similar, the conven-tional GAN-based framework cannot explicitly model rela-tionships among components and user-specified constraints.

𝑔"#$%&'

𝑧~𝑁(0,1)

𝑔#

𝑔0&Relation Prediction

Layout Generation

Refinement

B C

A

?

above ?

B C

A

left

above above

𝐺2 𝐺

𝐺,{}

𝐺,{ 𝑏𝑏6}

𝐺,{ 𝑏𝑏6, 𝑏𝑏7}

𝐺,{ 𝑏𝑏6,𝑏𝑏7,𝑏𝑏8}

𝐺,{ 9𝑏𝑏6,9𝑏𝑏7,9𝑏𝑏8}

𝑔"#$%&' 𝑔"#$%&'

𝑏𝑏6

𝑏𝑏7 𝑏𝑏:

9𝑏𝑏6

9𝑏𝑏7 9𝑏𝑏:

𝑏𝑏6

𝑏𝑏7 𝑏𝑏:

𝑏𝑏6

𝑏𝑏7

𝑏𝑏6

𝑔Graph Graph convolutionnetwork

Canvas

B C

A

left

above above

B C

A

left

above above

B C

A

left

above above

B C

A

left

above above

B C

A

left

above above

B C

A

left

above above

Figure 2: Framework illustration. Neural design network consists of three modules: relation prediction, bounding boxprediction, and refinement. We illustrate the process with a three-component example. In relation prediction module, themodel takes as inputs a graph with partial relations along with a latent vector (encoded from the graph with complete relationsduring training, sampled from prior during testing), and outputs a graph with complete relations. Only the graph with locationrelations is shown in the figure for brevity. In layout generation module, the model takes a graph with complete relationsas inputs, and predicts the bounding boxes of components in an iterative manner. In refinement module, the model furtherfine-tune the layout.

Graph neural networks in vision. Graph Neural Net-works (GNNs) [5, 7, 29] aim to model dependence amongnodes in a graph via message passing. GNNs are useful fordata that can be formulated in a graph data structure. Re-cently, GNNs and related models have been applied to clas-sification [19], scene graph [14, 33], motion modeling [12],and molecular property prediction [4, 13], to name a few. Inthis work, we model a design layout as a graph and applyGNNs to capture the dependency among components.

3. Graphic Layout Generation

Our goal is to generate design layouts given a set of de-sign components with user-specified constraints. For exam-ple, in image ads creation, the designers can input the con-straints such as “logo at bottom-middle of canvas”, “call-to-action button of size (100px, 500px)”, “call-to-action-button is below logo”, etc. The goal is to synthesize a setof design layouts that satisfy both the user-specified con-straints as well as common rules in image ads layouts. Un-like layout templates, these layouts are dynamically createdand can serve as inspirations for designers.

We introduce the neural design network using graphneural network and conditional variational auto-encoder(VAE) [18, 28] with the goal of capturing better represen-tations of design layouts. Figure 2 illustrates the processof generating a three-component design with the proposedneural design network. In the rest of this section, we firstdescribe the problem overview in Section 3.1. Then we

detail three modules in NDN: the relation prediction (Sec-tion 3.2) and layout generation modules (Section 3.3), andrefinement module (Section 3.4).

3.1. Problem Overview

The inputs to our network are a set of design compo-nents and user-specified constraints. We model the inputsas a graph, where each design component is a node andtheir relationships are edges. In this paper, we study twocommon relationships between design components: loca-tion and size.

Define G = {Gloc, Gsize} = (O,Eloc, Esize), whereO = {o0, o1, ..., on} is a set of n components with eachoi ∈ C coming from a set of categories C. We use o0 todenote the canvas that is fixed in both location and size, andoi to denote other design components that need to be placedon the canvas, such as logo, button. Eloc = {l1 ..., lml

}and Esize = {s1 ..., sms

} are sets of directed edges withlk = (oi, rl, oj) and sk = (oi, rs, oj), where rl ∈ Rloc andrs ∈ Rsize. Here, Rsize specifies the relative size of thecomponent, such as smaller or bigger, and rl can be left,right, above, below, upper-left, lower-left, etc. In addition,if anchoring on the canvas o0, we extend theRloc to capturethe location that is relative to the canvas, e.g., upper-left ofthe canvas.

Furthermore, in reality, designers often do not specify allthe constraints. This results in an input graph with missingedges. Figure 2 shows an example of a three-componentdesign with only one specified constraint “(A, above, B)”

and several unknown relations “?”. To this end, we augmentRloc and Rsize to include an additional unknown category,and represent graphs that contain unknown size or locationrelations as Gp

size and Gploc, respectively, to indicate they

are the partial graphs. In Section 3.2, we describe how topredict the unknown relations in the partial graphs.

Finally, we denote the output layout of the neural designnetwork as a set of bounding boxes {bb1, ..., bb|O|}, wherebbi = {xi, yi, wi, hi} represents the location and shape.

In all modules, we apply the graph convolution networkson graphs. The GNN takes as the input the features of nodesand edges, and outputs updated features. The input fea-tures can be one-hot vectors representing the categories orany embedded representations. More implementation de-tails can be found in the supplementary material.

3.2. Relation Prediction

In this module, we aim to infer the unknown relationsin the user-specified constraints. Figure 2 shows an exam-ple where a three-component graph is given and we needto predict the missing relations between A, B, and C. Forbrevity, we denote the graphs with complete relations as G,and the graphs with partial relations asGp, which can be ei-ther Gp

size or Gploc. Note that since the augmented relations

include the unknown category, both Gp and G are completegraphs in practice. We also use ei to refer to either li or si.

We model the prediction process as a paired graph-to-graph translation task: from Gp to G. Since the transla-tion is multimodal, i.e., a graph with partial relations canbe translated to many possible graphs with complete rela-tions. Therefore, we adopt a similar framework to the mul-timodal image-to-image translation [21, 34] and treat Gp

as the source domain and G as the target domain. Similarto [34], the translation is a conditional generation processthat maps the source graph, along with a latent code, to thetarget graph. The latent code is encoded from the corre-sponding target graph G to achieve reconstruction duringtraining, and is sampled from a prior during testing. Theconditional translation encoding process is modeled as:

z = gc(G) z ∈ Z,{hi} = gp(Gp, z) i = 1, ..., |E|,{ei} = hpred({hi}) i = 1, ..., |E|,

(1)

where gc and gp are graph convolution networks, and hpredis a relation predictor. In addition, E is the set of edges inthe target graph. Note that |E| = |E| since the graph is acomplete graph.

The model is trained with the reconstruction loss Lcls =CE({ei}, {ei}) on the relation categories, where the CEindicates cross-entropy function, and a KL loss on theencoded latent vectors to facilitate sampling in inferencetime: LKL = E[DKL((z)‖N (0, 1))], where DKL(p‖q) =

−∫p(z) log p(z)

q(z)dz. The objective of the relation predic-tion module is:

L2 = λclsLcls + λ2KLL2KL. (2)

The reconstruction loss captures the knowledge that the pre-dicted relations should agree with the existing relations inGp, and fill in any missing edge with the most likely rela-tion discovered in the training data.

3.3. Layout Generation

Given a graph with complete relations, this module aimsto generate the design layout by predicting the boundingboxes for all nodes in the graph.

Let G be the graph with complete relations constructedfrom Gsize and Gloc, the output of the relation predictionmodule. We model the generation process using a graph-based iterative conditional VAE model. We first obtain thefeatures of each component by

{fi}i=1∼|O| = genc(G), (3)

where genc is a graph convolution network. These featurescapture the relative relations among all components. Wethen predict bounding boxes in an iterative manner start-ing from an empty canvas (i.e., all bounding boxes are un-known). As shown in Figure 2, the prediction of eachbounding box is conditioned on the initial features as well asthe current canvas, i.e., predicted bounding boxes from pre-vious iterations. At iteration k, the condition can be mod-eled as:

tk = ({fi}i=1∼|O|, {bbi}i=1∼k−1),

ck = gupdate(tk),(4)

where gupdate is another graph convolution network. tk isa tuple of features and current canvas at iteration k, and ckis a vector. Then we apply conditional VAE on the currentbounding box bbk conditioned on ck.

z = hencbb (bbk, ck),

bbk = hdecbb (z, c),(5)

where hencbb and hdecbb represent encoders and decoders con-sisting of fully connected layers. We train the model withconventional VAE loss: a reconstruction loss Lrecon =|O|∑i=1

‖bbi− bbi‖1 and a KL loss LKL = E[DKL(z‖p(z|ck))].

The objective of the layout generation module is:

L1 = λreconLrecon + λ1KLL1KL. (6)

The model is trained with teacher forcing where the groundtruth bounding box at step k will be used as the input for

Datasets Ads Magazine RICO

FID ↓ Alignment ↓ FID ↓ Alignment ↓ FID ↓ Alignment ↓sg2im-none 116.63 0.0063 95.81 0.0097 269.60 0.0014LayoutVAE 138.11±38.91 0.0121±0.0008 81.56±36.78 0.0314±0.0011 192.11±29.97 0.0119±0.0039NDN-none 129.68±32.12 0.0091±0.0007 69.43±32.92 0.0251±0.0009 143.51±22.36 0.0091±0.0032

sg2im 230.44 0.0069 102.35 0.0178 190.68 0.007NDN-all 168.44±21.83 0.0061±0.0005 82.77±16.24 0.0151±0.0009 64.78±11.60 0.0032±0.0002

Accuracy ↓ Alignment ↓ Accuracy ↓ Alignment ↓ Accuracy ↓ Alignment ↓LayoutVAE-loo 0.071±0.002 0.0048±0.0001 0.059±0.002 0.0141±0.0002 0.045±0.0021 0.0039±0.0002

NDN-loo 0.043±0.001 0.0036±0.001 0.024±0.0002 0.0130±0.0001 0.018±0.002 0.0014±0.0001

real data - 0.0034 - 0.0126 - 0.0012

Table 1: Quantitative comparisons. We compare the proposed method to other works on three datasets using three settings:no-constraint setting that no prior constraint is provided (first row), all-constraint setting that all relations are provided (secondrow), and leave-one-out setting that aims to predict the bounding box of a component with ground-truth bounding boxes ofother components provided. The FID metric measures the realism and diversity, the alignment metric measures the alignmentamong components, and the accuracy metric measures the prediction accuracy in the leave-one-out setting.

step k+1. At test time, the model will use the actual outputboxes from previous steps. In addition, the latent vector zwill be sampled for a conditional prior distribution p(z|ck),where p is a prior encoder.

Bounding boxes with predefined shapes. In many de-sign use cases, it is often required to constrain some designcomponents to fixed size. For example, the logo size needsbe fixed in the ad design. To achieve this goal, we augmentthe original layout generation module with an additionalVAE encoder hencbb to ensure the encoded latent vectors zcan be decoded to bounding boxes with desired widths andheights. Similar to (5), given a ground-truth bounding boxbbk = (xk, yk, wk, hk), we obtain the reconstructed bound-ing box bbk = (xk, yk, wk, hk) with hencbb and hdecbb . Then,instead of applying reconstruction loss on whole boundingboxes tuples, we only enforce the reconstruction of widthand height with

Lsizerecon =

|O|∑i=1

‖wi − wi‖1+‖hi − hi‖1. (7)

The objective of the augmented layout generation moduleis given by:

L′1 = λsizereconLsizerecon + L1. (8)

3.4. Layout Refinement

We predict bounding boxes in an iterative manner thatrequires to fix the predicted bounding boxes from the previ-ous iteration. As a result, the overall bounding boxes mightnot be optimal, as shown in the layout generation module

in Figure 2. To tackle this issue, we fine-tune the bound-ing boxes for better alignment and visual quality in the finallayout refinement module. Given a graph G with ground-truth bounding boxes {bbi}, we simulate the misalignmentby randomly apply offsets δ ∼ U(−0.05, 0.05) on {bbi},where U is a uniform distribution. We obtain misalignedbounding boxes {bbi} = {bbi + δi}. We apply a graph con-volution network gft for finetuning:

{bbi} = gft(G, {bbi}). (9)

The model is trained with reconstruction loss Lft =∑i‖{bbi} − {bbi}‖1.

4. Experiments and Analysis

Datasets We perform the evaluation on three datasets:• Magazine [32]. The dataset contains 4k images of

magazine pages and 6 categories (texts, images, head-lines, over-image texts, over-image headlines, back-grounds).• RICO [3, 24]. The original dataset contains 91k im-

ages of the Android apps interface and 27 categories.We choose 13 most frequent categories (toolbars, im-ages, texts, icons, buttons, inputs, list items, advertise-ments, pager indicators, web views, background im-ages, drawers, modals) and filter the number of com-ponents within an image to be less than 10, totaling21k images.• Image banner ads. We collect 500 image banner ads

of the size 300×250 via image search using keywordssuch as “car ads”. We annotate bounding boxes of 6categories: images, regions of interest, logos, brand

Unarysize (%)

Binarysize (%)

Unarylocation (%)

Binarylocation (%)

Refinement FID ↓ Alignment ↓

0 0 0 0 3 143.51±22.36 0.0091±0.0032

20 20 0 0 3 141.64±20.01 0.0087±0.0030 0 20 20 3 129.92±23.76 0.0081±0.003

20 20 20 20 126.18±23.11 0.0074±0.00220 20 20 20 3 125.41±21.68 0.0070±0.002

100 100 100 100 70.55±12.68 0.0036±0.002100 100 100 100 3 64.78±11.60 0.0032±0.0002

Table 2: Ablation on partial constraints. We measure the FID and alignment of the proposed method taking differentpercentages of prior constraints as inputs using the RICO dataset. We also show that the refinement module can furtherimprove the visual quality as well as the alignment.

names, texts, and buttons. Examples can be found inthe supplementary materials.

Implementation Details. In this work, hencbb , hdecbb , andhpred consists of 3 fully-connected layers. In addition, gc,gp, genc, and gupdate consist of 3 graph convolution layers.The dimension of latent vectors z in the relation predictionand layout generation module is 32. For training, we usethe Adam optimizer [17] with batch size of 512, learningrate of 0.0001, and (β1, β2) = (0.5, 0.999). In all exper-iments, we set the hyper-parameters as follows: λcls = 1,λ1KL = 0.005, λrecon = λ2KL = 1, and λrecon = 10. Weuse a predefined order of component sets in all experiments.More implementation details can be found in the supple-mentary material.

Evaluated methods. We perform the evaluation on thefollowing algorithms:• sg2im [14]. The model is proposed to generate a natu-

ral scene layout from a given scene graph. The sg2immethod takes as inputs graphs with complete relationsin the setting where all constraints are provided. Whenwe compare with this method in the setting where noconstraint is given, we simplify the input scene graphby removing all relations. We refer the simplifiedmodel as sg2im-none.• LayoutVAE [15]. This model takes a label set as input,

and predicts the number of components for each labelas well as the locations of each component. We com-pare with the second stage of the LayoutVAE model(i.e., the bounding box prediction stage) by giving thenumber of components for each label. In addition, werefer to LayoutVAE-loo as the model that predicts thebounding box of a single component when all othercomponents are provided and fixed (the leave-one-outsetting).

• Neural Design Network. We refer to NDN-nonewhen the input contains no prior constraint, NDN-all in the same setting as sg2im when all constraintsare provided, and NDN-loo in the same setting asLayoutVAE-loo.

4.1. Quantitative Evaluation

Realism and accuracy. We evaluate the visual qualityfollowing Frechet Inception Distance (FID) [9] by measur-ing how close the distribution of generated layout is to thereal ones. Since there is no existing generally applicablefeature extractor for arbitrary graph, similar to [22], we traina binary layout classifier to discriminate between good andbad layouts. The bad layouts are generated by randomlymoving component locations of good layouts. The classi-fier consists of four graph convolution layers and three fullyconnected layers. We extract the features of the second fromthe last fully connected layer to measure FID.

We measure FID in two settings. First, a model pre-dicts bounding boxes without any constraints. That is, onlythe number and the category of components are provided.We compare with LayoutVAE and sg2im-none in this set-ting. Second, a model predicts bounding boxes with all con-straints provided. We compare with sg2im in this settingsince LayoutVAE cannot take constraints as inputs. Thefirst two rows in Table 1 present the results of these twosettings. Since LayoutVAE and the proposed method areboth stochastic models, we generate 100 samples for eachtesting design in each trial. The results are averaged over5 trials. In both no-constraint and all-constraint settings,the proposed method performs favorably against the otherschemes.

We also measure the prediction accuracy in the leave-one-out setting, i.e., predicting the bounding box of a com-ponent when bounding boxes of other components are pro-vided. The third row of Table 1 shows the comparison tothe LayoutVAE-loo method in this setting. The proposed

Components Ground-truth

Image

Text

Logo

Brand text

Button

Ads

RIC

OM

agaz

ine

Headline

Text Text

TextText

Text-over-img

Image

Image

Text Text

Text button

Text button

Background img

No-constraintLayoutVAEsg2im-none

All-constraintOurs sg2im Ours

ROI

Figure 3: Qualitative comparison. We evaluate the proposed method with the LayoutVAE and Sg2im methods in both no-constraint and all-constraint setting. The proposed method can better model the relations among components and generatelayouts of better visual quality.

method gains better accuracy with statistical significance (≥95%), indicating that the graph-based framework encodesbetter relations among components.

Alignment. Alignment is an important principle in designcreation. In most of good designs, components need to beeither in center alignment or edge alignment (e.g., left- orright-aligned). Therefore, in addition to realism, we explic-itly measure the alignment among components using:

1

ND

∑d

∑i

minj,i 6=j{min(l(cdi , c

dj ),m(cdi , c

dj ), r(cdi , c

dj )}),

(10)where ND is the number of generated layouts, cdk is the kthcomponent of the dth layout. In addition, l, m, and r arealignment functions where the distances between the left,center, and right of components are measured, respectively.

Table 1 presents the results in the no-constraint, all-constraint, and leave-one-out settings. The results are alsoaveraged over 5 trials. The proposed method performs fa-vorably against other methods. The sg2im-none methodgets better alignment score since it tends to predict bound-ing boxes in several fixed locations when no prior constraintis provided, which leads to worse FID. For similar reasons,the sg2im method gain a slightly higher alignment score onRICO.

Partial constraints. Previous experiments are conductedunder the settings of either no constraints or all constraints

provided. Now, we demonstrate the efficacy of the proposedmethod on handling partial constraints. Table 2 shows theresults of layout prediction with different percentages ofprior constraints provided. We evaluate partial constraintssetting using the RICO dataset, which is the most difficultdataset in terms of diversity and complexity. Ideally, theFID and alignment score should be similar regardless of thepercentage of constraints given. However, in the challeng-ing RICO dataset, the prior information of size and loca-tion still greatly improves the visual quality, as shown inTable 2, The location constraints contribute to more im-provement since they explicitly provide guidance from theground-truth layouts. As for the alignment score, layouts inall settings perform similarly. Furthermore, the refinementmodule can slightly improve the alignment score as well asFID.

4.2. Qualitative Evaluation

We compare the proposed method with related workin Figure 3. In the all-constraint setting, both the sg2immethod and the proposed model can generate reasonablelayouts similar to the ground-truth layouts. However, theproposed model can better tackle alignment and overlap-ping issues. In the no-constraint setting, the sg2im-nonemethod tends to place components of the same categoriesat the same location, like the “text”s in the second row andthe “text”s and “text button”s in the third row. The Lay-outVAE method, on the other hand, cannot handle relationsamong components well without using graphs. The pro-

Layout DesignCome see our new model!

RSVP

ACM SUV

Drive the perfect car everyday

FashionSpecial

xxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxXxxxxxxXxxxxxxxxXxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxXxxxxxxxxXxxxxxxxxXxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxXxxxxxxxxxXxxxxxxxxxxxXxxxxxxxxxxxXxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxXxxxxxxxxxXxxxxxxxxxxXxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxXxxxxxxxxxxXxxxxxxxxxxxXxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxXxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxXxxxxxxXxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Visit our city!

XXXXXXXXXXXXXXXXXXXXXXXXXXX

3 Days package

5 Days packageClick to know more

Click to know more

xxxx

xxxx

Log in

Cancel

Layout DesignComponents Constraints

Text

Logo

Button

at topText

Ads

RIC

OM

agaz

ine

Text

Text

Image

TextImage larger than

Text Textright to

Text

Image

Image ROI

Headline-over-img

Text-over-img

Page indicator

Text

Button Button

ButtonaboveImage

belowLogo Image

Image above Text

Page indicator at top

Image above Text

Image above Button

Components Constraints

Text

Logo

Text

Text

Image

TextImage larger than

Input

Image

Image ROI

Text-over-img

Background img

Input

Button Button

aboveLogo

left to Text

at middle

below Input

equal to Button

Brand text

Headline

Text

Text

aroundImage Text

at middle-leftROI

Brand text

Headline at top

Text

Background img

Button

Button

Image ROI size fixed Image size fixedLogo

Figure 4: Layout generation with partial user-specified constraints. We generate layouts according to different user-specified constraints on location and size. Furthermore, we construct designs with real assets based on the generated layoutsto better visualize the quality of our model.

posed method can generate layouts with good visual quality,even with no constraint provided.

Partial constraints. In Figure 4, we present the resultsof layout generation given several randomly selected con-straints on size and location. Our model generates designlayouts that are both realistic and follows user-specifiedconstraints. To better visualize the quality of the generatedlayouts, we present designs with real assets generated fromthe predicted layouts. Furthermore, we can constrain thesize of specific components to desired shapes (e.g., we fixthe image and logo size in the first row of Figure 4.) usingthe augmented layout generation module.

Layout recommendation. The proposed model can alsohelp designers decide the best locations of a specific designcomponent (e.g., logo, button, or headline) when a partialdesign layout is provided. This can be done by buildinggraphs with partial location and size relations based on thecurrent canvas and set the relations to target componentsas unknown. We then complete this graph using the re-lation prediction module. Finally, conditioned on the pre-dicted graph as well as current canvas, we perform iterativebounding boxes prediction with the layout generation mod-ule. Figure 5 shows examples of layout recommendations.

5. Conclusion

In this work, we propose a neural design network tohandle design layout generation given user-specified con-

Logo

Button

Headline

Text

Text

Text

Text

Text button

Text button

Ads

Mag

azin

eR

ICO

Figure 5: Layout Recommendation. We show examplesof layout recommendations where locations of targets arerecommended given the current layout.

straints. The proposed method can generate layouts that arevisually appealing and follow the constraints with a three-module framework, including a relation prediction mod-ule, a layout generation module, and a refinement module.Extensive quantitative and qualitative experiments demon-strate the efficacy of the proposed model. We also presentexamples of constructing real designs based on generatedlayouts, and an application of layout recommendation.

References[1] Zoya Bylinskii, Nam Wook Kim, Peter O’Donovan, Sami

Alsheikh, Spandan Madan, Hanspeter Pfister, Fredo Durand,Bryan Russell, and Aaron Hertzmann. Learning visual im-portance for graphic designs and data visualizations. InUIST, 2017. 2

[2] Niranjan Damera-Venkata, Jose Bento, and EamonnO’Brien-Strain. Probabilistic document model for automateddocument composition. In DocEng, 2011. 2

[3] Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hib-schman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ran-jitha Kumar. Rico: A mobile app dataset for building data-driven design applications. In UIST, 2017. 2, 5

[4] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre,Rafael Bombarell, Timothy Hirzel, Alan Aspuru-Guzik, andRyan P Adams. Convolutional networks on graphs for learn-ing molecular fingerprints. In NIPS, 2015. 3

[5] Christoph Goller and Andreas Kuchler. Learning task-dependent distributed representations by backpropagationthrough structure. In ICNN, 1996. 3

[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In NIPS, 2014.2

[7] Marco Gori, Gabriele Monfardini, and Franco Scarselli. Anew model for learning in graph domains. In IJCNN, 2005.3

[8] Tanmay Gupta, Dustin Schwenk, Ali Farhadi, Derek Hoiem,and Aniruddha Kembhavi. Imagine this! scripts to composi-tions to videos. In ECCV, 2018. 2

[9] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,Bernhard Nessler, and Sepp Hochreiter. GANs trained bya two time-scale update rule converge to a local nash equi-librium. In NIPS, 2017. 6

[10] Seunghoon Hong, Dingdong Yang, Jongwook Choi, andHonglak Lee. Inferring semantic layout for hierarchical text-to-image synthesis. In CVPR, 2018. 2

[11] Nathan Hurst, Wilmot Li, and Kim Marriott. Review of au-tomatic document formatting. In DocEng, 2009. 2

[12] Ashesh Jain, Amir R Zamir, Silvio Savarese, and AshutoshSaxena. Structural-rnn: Deep learning on spatio-temporalgraphs. In CVPR, 2016. 3

[13] Wengong Jin, Kevin Yang, Regina Barzilay, and TommiJaakkola. Learning multimodal graph-to-graph translationfor molecular optimization. In ICLR, 2019. 3

[14] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image genera-tion from scene graphs. In CVPR, 2018. 2, 3, 6

[15] Akash Abdu Jyothi, Thibaut Durand, Jiawei He, Leonid Si-gal, and Greg Mori. Layoutvae: Stochastic scene layout gen-eration from a label set. In ICCV, 2019. 2, 6

[16] Tero Karras, Samuli Laine, and Timo Aila. A style-basedgenerator architecture for generative adversarial networks. InCVPR, 2019. 2

[17] Diederik Kingma and Jimmy Ba. Adam: A method forstochastic optimization. In ICLR, 2015. 6

[18] Diederik P Kingma and Max Welling. Auto-encoding varia-tional bayes. In ICLR, 2014. 3

[19] Thomas N Kipf and Max Welling. Semi-supervised classi-fication with graph convolutional networks. In ICLR, 2017.3

[20] Ranjitha Kumar, Jerry O Talton, Salman Ahmad, and Scott RKlemmer. Bricolage: example-based retargeting for web de-sign. In SIGCHI, 2011. 2

[21] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Ma-neesh Kumar Singh, and Ming-Hsuan Yang. Diverse image-to-image translation via disentangled representations. InECCV, 2018. 4

[22] Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-ChunWang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz.Dancing to music. In NeurIPS, 2019. 6

[23] Jianan Li, Jimei Yang, Aaron Hertzmann, Jianming Zhang,and Tingfa Xu. Layoutgan: Generating graphic layouts withwireframe discriminators. In ICLR, 2019. 2

[24] Thomas F. Liu, Mark Craft, Jason Situ, Ersin Yumer,Radomir Mech, and Ranjitha Kumar. Learning design se-mantics for mobile apps. In UIST, 2018. 2, 5

[25] Peter ODonovan, Aseem Agarwala, and Aaron Hertzmann.Learning layouts for single-pagegraphic designs. TVCG,2014. 2

[26] Xufang Pang, Ying Cao, Rynson WH Lau, and Antoni BChan. Directing user attention via visual flow on web de-signs. ACM TOG (Proc. SIGGRAPH), 2016. 2

[27] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Gen-erating diverse high-fidelity images with vq-vae-2. arXivpreprint arXiv:1906.00446, 2019. 2

[28] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wier-stra. Stochastic backpropagation and approximate inferencein deep generative models. In ICML, 2014. 3

[29] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Ha-genbuchner, and Gabriele Monfardini. The graph neural net-work model. TNN, 2008. 3

[30] Sou Tabata, Hiroki Yoshihara, Haruka Maeda, and KeiYokoyama. Automatic layout generation for graphical de-sign magazines. In SIGGRAPH, 2019. 2

[31] Fuwen Tan, Song Feng, and Vicente Ordonez. Text2scene:Generating abstract scenes from textual descriptions. InCVPR, 2019. 2

[32] Ying Cao Xinru Zheng, Xiaotian Qiao and Rynson W.H.Lau. Content-aware generative modeling of graphic designlayouts. SIGGRAPH, 2019. 2, 5

[33] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and DeviParikh. Graph r-cnn for scene graph generation. In ECCV,2018. 3

[34] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Dar-rell, Alexei A Efros, Oliver Wang, and Eli Shechtman. To-ward multimodal image-to-image translation. In NIPS, 2017.4

Documents

arXiv:1912.09421v1 [cs.CV] 19 Dec 2019 · to-action button of size (100px, 500px)”, “call-to-action-button is below logo”, etc. The goal is to synthesize a set of design layouts