25
On the “steerability” of generative adversarial networks Ali Jahanian*, Lucy Chai*, Phillip Isola MIT CSAIL {jahanian,lrchai,phillipi}@mit.edu - Zoom + - Shift Y + - Shift X + - Brightness + - Rotate2D + - Rotate3D + Figure 1: Learned latent space trajectories in generative adversarial networks correspond to visual transformations like camera shift and zoom. These transformations can change the distribution of generated data, but only so much – biases in the data, like centered objects, reveal themselves as objects get “stuck” at the image borders when we try to shift them out of frame. Take the “steering wheel”, drive in the latent space, and explore the natural image manifold via generative transformations! Abstract An open secret in contemporary machine learning is that many models work beautifully on standard benchmarks but fail to generalize outside the lab. This has been attributed to training on biased data, which provide poor coverage over real world events. Generative models are no excep- tion, but recent advances in generative adversarial net- works (GANs) suggest otherwise – these models can now synthesize strikingly realistic and diverse images. Is gen- erative modeling of photos a solved problem? We show that although current GANs can fit standard datasets very well, they still fall short of being comprehensive models of the visual manifold. In particular, we study their ability to fit simple transformations such as camera movements and color changes. We find that the models reflect the biases of the datasets on which they are trained (e.g., centered ob- jects), but that they also exhibit some capacity for gener- alization: by “steering” in latent space, we can shift the distribution while still creating realistic images. We hypoth- esize that the degree of distributional shift is related to the breadth of the training data distribution, and conduct ex- periments that demonstrate this. Code is released on our project page: https://ali-design.github.io/ gan_steerability/ 1. Introduction The quality of deep generative models has increased dramatically over the past few years. When introduced in 2014, Generative Adversarial Networks (GANs) could only synthesize MNIST digits and low-resolution grayscale faces [10]. The most recent models, on the other hand, pro- duce diverse and high-resolution images that are often in- distinguishable from natural photos [4, 13]. Science fiction has long dreamed of virtual realities filled of synthetic content as rich, or richer, than the real world (e.g., The Matrix, Ready Player One). How close are we to this dream? Traditional computer graphics can render photorealistic 3D scenes, but has no way to automatically generate detailed content. Generative models like GANs, in contrast, can hallucinate content from scratch, but we do not currently have tools for navigating the generated scenes in the same kind of way as you can walk through and interact with a 3D game engine. In this paper, we explore the degree to which you can navigate the visual world of a GAN. Figure 1 illustrates the kinds of transformations we explore. Consider the dog at the top-left. By moving in some direction of GAN la- tent space, can we hallucinate walking toward this dog? As the figure indicates, and as we will show in this paper, the answer is yes. However, as we continue to zoom in, 1 arXiv:1907.07171v1 [cs.CV] 16 Jul 2019

arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

On the ldquosteerabilityrdquo of generative adversarial networks

Ali Jahanian Lucy Chai Phillip IsolaMIT CSAIL

jahanianlrchaiphillipimitedu

- Zoom +

- Shift Y +

- Shift X +

- Brightness +

- Rotate2D +

- Rotate3D +

Figure 1 Learned latent space trajectories in generative adversarial networks correspond to visual transformations like camerashift and zoom These transformations can change the distribution of generated data but only so much ndash biases in the datalike centered objects reveal themselves as objects get ldquostuckrdquo at the image borders when we try to shift them out of frameTake the ldquosteering wheelrdquo drive in the latent space and explore the natural image manifold via generative transformations

Abstract

An open secret in contemporary machine learning is thatmany models work beautifully on standard benchmarks butfail to generalize outside the lab This has been attributedto training on biased data which provide poor coverageover real world events Generative models are no excep-tion but recent advances in generative adversarial net-works (GANs) suggest otherwise ndash these models can nowsynthesize strikingly realistic and diverse images Is gen-erative modeling of photos a solved problem We showthat although current GANs can fit standard datasets verywell they still fall short of being comprehensive models ofthe visual manifold In particular we study their ability tofit simple transformations such as camera movements andcolor changes We find that the models reflect the biases ofthe datasets on which they are trained (eg centered ob-jects) but that they also exhibit some capacity for gener-alization by ldquosteeringrdquo in latent space we can shift thedistribution while still creating realistic images We hypoth-esize that the degree of distributional shift is related to thebreadth of the training data distribution and conduct ex-periments that demonstrate this Code is released on ourproject page httpsali-designgithubiogan_steerability

1 Introduction

The quality of deep generative models has increaseddramatically over the past few years When introducedin 2014 Generative Adversarial Networks (GANs) couldonly synthesize MNIST digits and low-resolution grayscalefaces [10] The most recent models on the other hand pro-duce diverse and high-resolution images that are often in-distinguishable from natural photos [4 13]

Science fiction has long dreamed of virtual realities filledof synthetic content as rich or richer than the real world(eg The Matrix Ready Player One) How close are weto this dream Traditional computer graphics can renderphotorealistic 3D scenes but has no way to automaticallygenerate detailed content Generative models like GANs incontrast can hallucinate content from scratch but we do notcurrently have tools for navigating the generated scenes inthe same kind of way as you can walk through and interactwith a 3D game engine

In this paper we explore the degree to which you cannavigate the visual world of a GAN Figure 1 illustratesthe kinds of transformations we explore Consider the dogat the top-left By moving in some direction of GAN la-tent space can we hallucinate walking toward this dogAs the figure indicates and as we will show in this paperthe answer is yes However as we continue to zoom in

1

arX

iv1

907

0717

1v1

[cs

CV

] 1

6 Ju

l 201

9

we quickly reach limits The direction in latent space thatzooms toward the dog can only make the dog so big Oncethe dog face fills the full frame continuing to walk in thisdirection fails to increase the zoom A similar effect occursin the daisy example (row 2 of Fig 1) we can find a di-rection in latent space that shifts the daisy up or down butthe daisy gets ldquostuckrdquo at the edge of the frame ndash continuingto walk in this direction does not shift the daisy out of theframe

We hypothesize that these limits are due to biases in thedistribution of images on which the GAN is trained Forexample if dogs and daisies tend to be centered and inframe in the dataset the same may be the case in the GAN-generated images as well Nonetheless we find that somedegree of transformation is possible we can still shift thegenerated daisies up and down in the frame When and whycan we achieve certain transformations but not others

This paper seeks to quantify the degree to which we canachieve basic visual transformations by navigating in GANlatent space In other words are GANs ldquosteerablerdquo in la-tent space1 We further analyze the relationship betweenthe data distribution on which the model is trained and thesuccess in achieving these transformations We find that itis possible to shift the distribution of generated images tosome degree but cannot extrapolate entirely out of the sup-port of the training data In particular we find that attributescan be shifted in proportion to the variability of that attributein the training data

One of the current criticisms of generative models is thatthey simply interpolate between datapoints and fail to gen-erate anything truly new Our results add nuance to thisstory It is possible to achieve distributional shift where themodel generates realistic images from a different distribu-tion than that on which it was trained However the abilityto steer the model in this way depends on the data distribu-tion having sufficient diversity along the dimension we aresteering

Our main contributions are

bull We show a simple walk in the z space of GANs canachieve camera motion and color transformations inthe output space in self-supervised manner without la-beled attributes or distinct source and target images

bull We show this linear walk is as effective as more com-plex non-linear walks suggesting that GANs modelslearn to linearize these operations without being ex-plicitly trained to do so

bull We design measures to quantify the extent and limitsof the transformations generative models can achieveand show experimental evidence We further show the

1We use the term ldquosteerablerdquo in analogy to the classic steerable filtersof Freeman amp Adelson [7]

relation between shiftability of model distribution andthe variability in datasets

bull We demonstrate these transformations as a general-purpose framework on different model architectureseg BigGAN and StyleGAN illustrating differentdisentanglement properties in their respective latentspaces

2 Related workWalking in the latent space can be seen from different

perspectives how to achieve it what limits it and whatdoes it enable us to do Our work addresses these threeaspects together and we briefly refer to each one in priorwork

Interpolations in GAN latent space Traditional ap-proaches to image editing in latent space involve findinglinear directions in GAN latent space that correspond tochanges in labeled attributes such as smile-vectors andgender-vectors for faces [20 13] However smoothly vary-ing latent space interpolations are not exclusive to GANsin flow-based generative models linear interpolations be-tween encoded images allow one to edit a source image to-ward attributes contained in a separate target [16] WithDeep Feature Interpolation [24] one can remove the gener-ative model entirely Instead the interpolation operation isperformed on an intermediate feature space of a pretrainedclassifier again using the feature mappings of source andtarget sets to determine an edit direction Unlike these ap-proaches we learn our latent-space trajectories in a self-supervised manner without relying on labeled attributes ordistinct source and target images We measure the edit dis-tance in pixel space rather than latent space because ourdesired target image may not directly exist in the latentspace of the generative model Despite allowing for non-linear trajectories we find that linear trajectories in the la-tent space model simple image manipulations ndash eg zoom-vectors and shift-vectors ndash surprisingly well even whenmodels were not explicitly trained to exhibit this propertyin latent space

Dataset bias Biases from training data and network ar-chitecture both factor into the generalization capacity oflearned models [23 8 1] Dataset biases partly comes fromhuman preferences in taking photos we typically captureimages in specific ldquocanonicalrdquo views that are not fully rep-resentative of the entire visual world [19 12] When mod-els are trained to fit these datasets they inherit the biasesin the data Such biases may result in models that misrepre-sent the given task ndash such as tendencies towards texture biasrather than shape bias on ImageNet classifiers [8] ndash whichin turn limits their generalization performance on similar

objectives [2] Our latent space trajectories transform theoutput corresponding to various camera motion and imageediting operations but ultimately we are constrained by bi-ases in the data and cannot extrapolate arbitrarily far beyondthe dataset

Deep learning for content creation The recent progressin generative models has enabled interesting applicationsfor content creation [4 13] including variants that en-able end users to control and fine-tune the generated out-put [22 26 3] A by-product the current work is to furtherenable users to modify various image properties by turninga single knob ndash the magnitude of the learned transformationSimilar to [26] we show that GANs allow users to achievebasic image editing operations by manipulating the latentspace However we further demonstrate that these imagemanipulations are not just a simple creativity tool they alsoprovide us with a window into biases and generalization ca-pacity of these models

Concurrent work We note a few concurrent papers thatalso explore trajectories in GAN latent space [6] learns lin-ear walks in the latent space that correspond to various fa-cial characteristics they use these walks to measure biasesin facial attribute detectors whereas we study biases in thegenerative model that originate from training data [21] alsoassumes linear latent space trajectories and learns paths forface attribute editing according to semantic concepts suchas age and expression thus demonstrating disentanglementproperties of the latent space [9] applies a linear walk toachieve transformations in learning and editing features thatpertain cognitive properties of an image such as memorabil-ity aesthetics and emotional valence Unlike these workswe do not require an attribute detector or assessor functionto learn the latent space trajectory and therefore our lossfunction is based on image similarity between source andtarget images In addition to linear walks we explore us-ing non-linear walks to achieve camera motion and colortransformations

3 Method

Generative models such as GANs [10] learn a mappingfunction G such that G z rarr x Here z is the latentcode drawn from a Gaussian density and x is an outputeg an image Our goal is to achieve transformations inthe output space by walking in the latent space as shownin Fig 2 In general this goal also captures the idea inequivariance where transformations in the input space resultin equivalent transformations in the output space (eg referto [11 5 17])

z

αw

Latent Space Generated Image

edit(G(z α))

G(z)

Target Image

G(z + αw)

Z X

z

f(z)

Non-Linear walk Linear walk

z + αwf( f( f( f(z))))

Z

OR

Optimal path

Figure 2 We aim to find a path in z space to transform thegenerated image G(z) to its edited version edit(G(z α))eg an αtimes zoom This walk results in the generated imageG(z+αw) when we choose a linear walk orG(f(f((z)))when we choose a non-linear walk

Objective How can we maneuver in the latent space todiscover different transformations on a given image Wedesire to learn a vector representing the optimal path forthis latent space transformation which we parametrize asan N -dimensional learned vector We weight this vector bya continuous parameter α which signifies the step size ofthe transformation large α values correspond to a greaterdegree of transformation while small α values correspondto a lesser degree Formally we learn the walk w by mini-mizing the objective function

wlowast = argminw

Ezα[L(G(z+αw)edit(G(z) α))] (1)

Here L measures the distance between the generated im-age after taking an α-step in the latent direction G(z+αw)and the target edit(G(z) α) derived from the source im-age G(z) We use L2 loss as our objective L however wealso obtain similar results when using the LPIPS perceptualimage similarity metric [25] (see B41) Note that we canlearn this walk in a fully self-supervised manner ndash we per-form our desired transformation on an arbitrary generatedimage and subsequently we optimize our walk to satisfythe objective We learn the optimal walk wlowast to model thetransformation applied in edit(middot) Let model(α) de-note such transformation in the direction of wlowast controlledwith the variable step size α defined as model(α) =G(z + αwlowast)

The previous setup assumes linear latent space walks butwe can also optimize for non-linear trajectories in whichthe walk direction depends on the current latent spaceposition For the non-linear walk we learn a functionflowast(z) which corresponds to a small ε-step transforma-tion edit(G(z) ε) To achieve bigger transformations welearn to apply f recursively mimicking discrete Euler ODEapproximations Formally for a fixed ε we minimize

L = Ezn[||fn(z)minus edit(G(z) nε))||] (2)

where fn(middot) is an nth-order function compositionf(f(f())) and f(z) is parametrized with a neural net-work We discuss further implementation details in A4

- Rotate 2D +058 005005

- Shift Y +058 013021

032 003001- Zoom +

Figure 3 Transformation limits As we increase the magnitude of the learned walk vector we reach a limit to how muchwe can achieve a given transformation The learned operation either ceases to transform the image any further or the imagestarts to deviate from the natural image manifold Below each figure we also indicate the average LPIPS perceptual distancebetween 200 sampled image pairs of that category Perceptual distance decreases as we move farther from the source (centerimage) which indicates that the images are converging

We use this function composition approach rather than thesimpler setup ofG(z+αNN(z)) because the latter learns toignore the input z when α takes on continuous values andis thus equivalent to the previous linear trajectory (see A3for further details)

Quantifying Steerability We further seek to quantify theextent to which we are able to achieve the desired imagemanipulations with respect to each transformation To thisend we compare the distribution of a given attribute egldquoluminancerdquo in the dataset versus in images generated afterwalking in latent space

For color transformations we consider the effect of in-creasing or decreasing the α coefficient corresponding toeach color channel To estimate the color distribution ofmodel-generated images we randomly sample N = 100pixels per image both before and after taking a step in latentspace Then we compute the pixel value for each channelor the mean RGB value for luminance and normalize therange between 0 and 1

For zoom and shift transformations we rely on an ob-ject detector which captures the central object in the imageclass We use a MobileNet-SSD v1 [18] detector to esti-mate object bounding boxes and restrict ourselves to imageclasses recognizable by the detector For each successfuldetection we take the highest probability bounding box cor-responding to the desired class and use that to quantify thedegree to which we are able to achieve each transformationFor the zoom operation we use the area of the boundingbox normalized by the area of the total image For shift in

the X and Y directions we take the center X and Y coordi-nates of the bounding box and normalize by image widthor height

Truncation parameters in GANs (as used in [4 13]) tradeoff between variety of the generated images and samplequality When comparing generated images to the datasetdistribution we use the largest possible truncation we usethe largest possible truncation for the model and performsimilar cropping and resizing of the dataset as done dur-ing model training (see [4]) When comparing the attributesof generated distributions under different α magnitudes toeach other but not to the dataset we reduce truncation to05 to ensure better performance of the object detector

4 ExperimentsWe demonstrate our approach using BigGAN [4] a

class-conditional GAN trained on 1000 ImageNet cate-gories We learn a shared latent space walk by averagingacross the image categories and further quantify how thiswalk affects each class differently We focus on linear walksin latent space for the main text and show additional resultson nonlinear walks in Sec 43 and B42 We also conductexperiments on StyleGAN [13] which uses an uncondi-tional style-based generator architecture in Sec 43 and B5

41 What image transformations can we achieve bysteering in the latent space

We show qualitative results of the learned transforma-tions in Fig 1 By walking in the generator latent spacewe are able to learn a variety of transformations on a given

source image (shown in the center image of each transfor-mation) Interestingly several priors come into play whenlearning these image transformations When we shift adaisy downwards in the Y direction the model hallucinatesthat the sky exists on the top of the image However whenwe shift the daisy up the model inpaints the remainder ofthe image with grass When we alter the brightness of a im-age the model transitions between nighttime and daytimeThis suggests that the model is able to extrapolate from theoriginal source image and yet remain consistent with theimage context

We further ask the question how much are we able toachieve each transformation When we increase the stepsize of α we observe that the extent to which we canachieve each transformation is limited In Fig 3 we ob-serve two potential failure cases one in which the the im-age becomes unrealistic and the other in which the imagefails to transform any further When we try to zoom in ona Persian cat we observe that the cat no longer increases insize beyond some point In fact the size of the cat underthe latent space transformation consistently undershoots thetarget zoom On the other hand when we try to zoom out onthe cat we observe that it begins to fall off the image man-ifold and at some point is unable to become any smallerIndeed the perceptual distance (using LPIPS) between im-ages decreases as we push α towards the transformationlimits Similar trends hold with other transformations weare able to shift a lorikeet up and down to some degree untilthe transformation fails to have any further effect and yieldsunrealistic output and despite adjusting α on the rotationvector we are unable to rotate a pizza Are the limitationsto these transformations governed by the training datasetIn other words are our latent space walks limited becausein ImageNet photos the cats are mostly centered and takenwithin a certain size We seek to investigate and quantifythese biases in the next sections

An intriguing property of the learned trajectory is thatthe degree to which it affects the output depends on the im-age class In Fig 4 we investigate the impact of the walkfor different image categories under color transformationsBy stepping w in the direction of increasing redness in theimage we are able to successfully recolor a jellyfish but weare unable to change the color of a goldfinch ndash the goldfinchremains yellow and the only effect of w is to slightly red-den the background Likewise increasing the brightness ofa scene transforms an erupting volcano to a dormant vol-cano but does not have such effect on Alps which onlytransitions between night and day In the third example weuse our latent walk to turn red sports cars to blue When weapply this vector to firetrucks the only effect is to slightlychange the color of the sky Again perceptual distance overmultiple image samples per class confirms these qualitativeobservations a 2-sample t-test yields t = 2077 p lt 0001

for jellyfishgoldfinch t = 814 p lt 0001 for volcanoalpand t = 684 p lt 0001 for sports carfire engine Wehypothesize that the differential impact of the same trans-formation on various image classes relates to the variabilityin the underlying dataset Firetrucks only appear in redbut sports cars appear in a variety of colors Therefore ourcolor transformation is constrained by the biases exhibitedin individual classes in the dataset

- Blueness +

- Darkness +

- Redness +jellyfish goldfinch

00

05

10

Per

ceptu

alD

ista

nce

sports car fire engine00

02

04

Per

ceptu

alD

ista

nce

volcano alp00

05

Per

ceptu

alD

ista

nce

Figure 4 Each pair of rows shows how a single latent direc-tion w affects two different ImageNet classes We observethat the same direction affects different classes differentlyand in a way consistent with semantic priors (eg ldquoVolca-noesrdquo explode ldquoAlpsrdquo do not) Boxplots show the LPIPSperceptual distance before and after transformation for 200samples per class

Motivated by our qualitative observations in Fig 1Fig 3 and Fig 4 we next quantitatively investigate thedegree to which we are able to achieve certain transforma-tions

Limits of Transformations In quantifying the extent ofour transformations we are bounded by two opposing fac-tors how much we can transform the distribution while alsoremaining within the manifold of realistic looking images

For the color (luminance) transformation we observethat by changing α we shift the proportion of light anddark colored pixels and can do so without large increases inFID (Fig 5 first column) However the extent to which wecan move the distribution is limited and each transformeddistribution still retains some probability mass over the fullrange of pixel intensities

10 05 00 05 10crarr

00

05

10

Inte

rsec

tion

200 100 0 100 200crarr

00

05

10

Inte

rsec

tion

200 100 0 100 200crarr

00

05

10

Inte

rsec

tion

2 0 2log(crarr)

00

05

10

Inte

rsec

tion

00 02 04 06 08 10Pixel Intensity

000

025

050

075

100

125

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

20

25

30

PD

F

105

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

10 05 00 05 10crarr

20

40

FID

200 100 0 100 200crarr

20

40

FID

200 100 0 100 200crarr

20

40

FID

2 0 2log(crarr)

20

40

FID

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Pixel Intensity

000

025

050

075

100

125

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

20

25

30

PD

F

105

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

Figure 5 Quantifying the extent of transformations We compare the attributes of generated images under the raw modeloutput G(z) compared to the distribution under a learned transformation model(α) We measure the intersection betweenG(z) and model(α) and also compute the FID on the transformed image to limit our transformations to the natural imagemanifold

Using the shift vector we show that we are able to movethe distribution of the center object by changing α In theunderlying model the center coordinate of the object ismost concentrated at half of the image width and height butafter applying the shift in X and shift in Y transformationthe mode of the transformed distribution varies between 03and 07 of the image widthheight To quantify the extentto which the distribution changes we compute the area ofintersection between the original model distribution and thedistribution after applying the transformation and we ob-serve that the intersection decreases as we increase or de-crease the magnitude of α However our transformationsare limited to a certain extent ndash if we increase α beyond 150pixels for vertical shifts we start to generate unrealistic im-ages as evidenced by a sharp rise in FID and convergingmodes in the transformed distributions (Fig 5 columns 2 amp3)

We perform a similar procedure for the zoom operationby measuring the area of the bounding box for the detectedobject under different magnitudes of α Like shift we ob-serve that subsequent increases in α magnitude start to havesmaller and smaller effects on the mode of the resulting dis-tribution (Fig 5 last column) Past an 8x zoom in or outwe observe an increase in the FID of the transformed out-put signifying decreasing image quality Interestingly for

zoom the FID under zooming in and zooming out is anti-symmetric indicating that the success to which we are ableto achieve the zoom-in operation under the constraint of re-alistic looking images differs from that of zooming out

These trends are consistent with the plateau in trans-formation behavior that we qualitatively observe in Fig 3Although we can arbitrarily increase the α step size aftersome point we are unable to achieve further transformationand risk deviating from the natural image manifold

42 How does the data affect the transformations

How do the statistics of the data distribution affect theability to achieve these transformations In other words isthe extent to which we can transform each class as we ob-served in Fig 4 due to limited variability in the underlyingdataset for each class

One intuitive way of quantifying the amount of transfor-mation achieved dependent on the variability in the datasetis to measure the difference in transformed model meansmodel(+α) and model(-α) and compare it to thespread of the dataset distribution For each class we com-pute standard deviation of the dataset with respect to ourstatistic of interest (pixel RGB value for color and bound-ing box area and center value for zoom and shift transfor-mations respectively) We hypothesize that if the amount of

000 025 050 075 100

Area

0

2

4

P(A

)

105

000 025 050 075 100

Area

0

2

4

P(A

)

105

020 025 030

Data

00

01

02

03

Model

micro

Zoom

r = 037 p lt 0001

005 010 015

Data

00

01

02

03

Model

micro

Shift Y

r = 039 p lt 0001

010 015 020

Data

00

01

02

03

Model

micro

Shift X

r = 028 p lt 0001

020 025 030

Data

00

01

02

03

Model

micro

Luminance

r = 059 p lt 0001

- Zoom +

Rob

inLa

ptop

Figure 6 Understanding per-class biases We observe a correlation between the variability in the training data for ImageNetclasses and our ability to shift the distribution under latent space transformations Classes with low variability (eg robin)limit our ability to achieve desired transformations in comparison to classes with a broad dataset distribution (eg laptop)

transformation is biased depending on the image class wewill observe a correlation between the distance of the meanshifts and the standard deviation of the data distribution

More concretely we define the change in model meansunder a given transformation as

∆microk =∥∥microkmodel(+αlowast) minus microkmodel(-αlowast)

∥∥2 (3)

for a given class k under the optimal walk for each trans-formation wlowast under Eq 5 and we set αlowast to be largest andsmallest α values used in training We note that the degreeto which we achieve each transformation is a function of αso we use the same value of α across all classes ndash one thatis large enough to separate the means of microkmodel(αlowast) andmicrokmodel(-αlowast) under transformation but also for which theFID of the generated distribution remains below a thresholdT of generating reasonably realistic images (for our experi-ments we use T = 22)

We apply this measure in Fig 6 where on the x-axis weplot the standard deviation σ of the dataset and on the y-axis we show the model ∆micro under a +αlowast andminusαlowast transfor-mation as defined in Eq 3 We sample randomly from 100classes for the color zoom and shift transformations andgenerate 200 samples of each class under the positive andnegative transformations We use the same setup of draw-ing samples from the model and dataset and computing the

statistics for each transformation as described in Sec 41Indeed we find that the width of the dataset distribu-

tion captured by the standard deviation of random samplesdrawn from the dataset for each class is related to the extentto which we achieve each transformation For these trans-formations we observe a positive correlation between thespread of the dataset and the magnitude of ∆micro observedin the transformed model distributions Furthermore theslope of all observed trends differs significantly from zero(p lt 0001 for all transformations)

For the zoom transformation we show examples of twoextremes along the trend We show an example of theldquorobinrdquo class in which the spread σ in the dataset is low andsubsequently the separation ∆micro that we are able to achieveby applying +αlowast and minusαlowast transformations is limited Onthe other hand in the ldquolaptoprdquo class the underlying spreadin the dataset is broad ImageNet contains images of laptopsof various sizes and we are able to attain a wider shift in themodel distribution when taking a αlowast-sized step in the latentspace

From these results we conclude that the extent to whichwe achieve each transformation is related to the variabil-ity that inherently exists in the dataset Consistent with ourqualitative observations in Fig 4 we find that if the im-ages for a particular class have adequate coverage over the

- Zoom + - Zoom +2x 4x

Linear Lpips

8x1x05x025x0125x 2x 4x 8x1x05x025x0125x

Linear L2 Non-linear L2

Non-linear Lpips

Figure 7 Comparison of linear and nonlinear walks for the zoom operation The linear walk undershoots the targeted levelof transformation but maintains more realistic output

entire range of a given transformation then we are betterable to move the model distribution to both extremes Onthe other hand if the images for the given class exhibit lessvariability the extent of our transformation is limited by thisproperty of the dataset

43 Alternative architectures and walks

We ran an identical set of experiments using the nonlin-ear walk in the BigGAN latent space and obtain similarquantitative results To summarize the Pearsonrsquos correla-tion coefficient between dataset σ and model ∆micro for linearwalks and nonlinear walks is shown in Table 1 and fullresults in B42 Qualitatively we observe that while thelinear trajectory undershoots the targeted level of transfor-mation it is able to preserve more realistic-looking results(Fig 7) The transformations involve a trade-off betweenminimizing the loss and maintaining realistic output andwe hypothesize that the linear walk functions as an implicitregularizer that corresponds well with the inherent organi-zation of the latent space

Luminance Shift X Shift Y ZoomLinear 059 028 039 037

Non-linear 049 049 055 060

Table 1 Pearsonrsquos correlation coefficient between datasetσ and model ∆micro for measured attributes p-value for slopelt 0001 for all transformations

To test the generality of our findings across model archi-tecture we ran similar experiments on StyleGAN in whichthe latent space is divided into two spaces z andW As [13]notes that theW space is less entangled than z we apply thelinear walk to W and show results in Fig 8 and B5 Oneinteresting aspect of StyleGAN is that we can change colorwhile leaving other structure in the image unchanged Inother words while green faces do not naturally exist in thedataset the StyleGAN model is still able to generate themThis differs from the behavior of BigGAN where changing

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

- Luminance +

- Red +

- Green +

- Blue +Figure 8 Distribution for luminance transformation learnedfrom the StyleGAN cars generator and qualitative exam-ples of color transformations on various datasets usingStyleGAN

color results in different semantics in the image eg turn-ing a dormant volcano to an active one or a daytime sceneto a nighttime one StyleGAN however does not preservethe exact geometry of objects under other transformationseg zoom and shift (see Sec B5)

5 Conclusion

GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much

AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx

Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC

References[1] A Amini A Soleimany W Schwarting S Bhatia and

D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2

[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2

[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3

[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4

[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3

[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3

[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2

[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2

[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3

[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3

[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3

[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2

[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12

[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24

[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10

[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2

[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3

[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4

[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2

[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2

[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12

[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3

[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2

[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2

[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11

[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3

Appendix

A Method detailsA1 Optimization for the linear walk

We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model

A2 Implementation Details for Linear Walk

We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below

Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage

Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene

Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB

where we jointly learn the three walk directions wR wGand wB

Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget

Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing

A3 Linear NN(z) walk

Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z

Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is

edit(G(z) 2α) = edit(edit(G(z) α) α)

To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows

z1 = z + αF (z)

z2 = z1 + αF (z1)

However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α

z + 2αF (z) = z1 + αF (z1) (4)

Therefore

z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)

Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z

A4 Optimization for the non-linear walk

Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)

We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps

B Additional ExperimentsB1 Model and Data Distributions

How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data

B2 Quantifying Transformation Limits

We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases

Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent

trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)

B3 Detected Bounding Boxes

To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis

B4 Alternative Walks in BigGAN

B41 LPIPS objective

In the main text we learn the latent space walk w by mini-mizing the objective function

J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)

using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)

B42 Non-linear Walks

Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic

2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API

B43 Additional Qualitative Examples

We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively

B5 Walks in StyleGAN

We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses

in Figs 20 22 24

B6 Qualitative examples for additional transfor-mations

Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)

00 02 04 06 08 10Area

05

10

15

20

PD

F

105

data

model

00 02 04 06 08 10Center Y

0

1

2

PD

F

102

data

model

00 02 04 06 08 10Center X

00

05

10

15

20

PD

F

102

data

model

00 02 04 06 08 10Pixel Intensity

2

3

4

5

PD

F

103

data

model

ZoomShift YShift XLuminance

Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset

Luminance

Rotate 2D

Shift X Shift Y

Zoom

10 05 00 05 10crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

50 0 50crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

2 1 0 1 2log(crarr)

00

02

04

06

08

Per

ceptu

alD

ista

nce

Rotate 3D

500 250 0 250 500crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude

Figure 11 Bounding boxes for random selected classes using ImageNet training images

Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α

Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)

Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk

10 05 00 05 10crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100In

ters

ection

2 0 2log(crarr)

000

025

050

075

100

Inte

rsec

tion

020 025 030Data

01

02

03

04

05

Model

micro

010 015 020Data

00

02

04

Model

micro

005 010 015Data

01

02

03

04

Model

micro

02 03Data

02

04

06

08

Model

micro

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

10 05 00 05 10crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

2 0 2log(crarr)

20

30

40

50

FID

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25P

DF

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

p=049 p=049 p=055 p=060

Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation

- Color +- Zoom +

- Shift X + - Shift Y +

- Rotate 2D + - Rotate 3D +

Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 2: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

we quickly reach limits The direction in latent space thatzooms toward the dog can only make the dog so big Oncethe dog face fills the full frame continuing to walk in thisdirection fails to increase the zoom A similar effect occursin the daisy example (row 2 of Fig 1) we can find a di-rection in latent space that shifts the daisy up or down butthe daisy gets ldquostuckrdquo at the edge of the frame ndash continuingto walk in this direction does not shift the daisy out of theframe

We hypothesize that these limits are due to biases in thedistribution of images on which the GAN is trained Forexample if dogs and daisies tend to be centered and inframe in the dataset the same may be the case in the GAN-generated images as well Nonetheless we find that somedegree of transformation is possible we can still shift thegenerated daisies up and down in the frame When and whycan we achieve certain transformations but not others

This paper seeks to quantify the degree to which we canachieve basic visual transformations by navigating in GANlatent space In other words are GANs ldquosteerablerdquo in la-tent space1 We further analyze the relationship betweenthe data distribution on which the model is trained and thesuccess in achieving these transformations We find that itis possible to shift the distribution of generated images tosome degree but cannot extrapolate entirely out of the sup-port of the training data In particular we find that attributescan be shifted in proportion to the variability of that attributein the training data

One of the current criticisms of generative models is thatthey simply interpolate between datapoints and fail to gen-erate anything truly new Our results add nuance to thisstory It is possible to achieve distributional shift where themodel generates realistic images from a different distribu-tion than that on which it was trained However the abilityto steer the model in this way depends on the data distribu-tion having sufficient diversity along the dimension we aresteering

Our main contributions are

bull We show a simple walk in the z space of GANs canachieve camera motion and color transformations inthe output space in self-supervised manner without la-beled attributes or distinct source and target images

bull We show this linear walk is as effective as more com-plex non-linear walks suggesting that GANs modelslearn to linearize these operations without being ex-plicitly trained to do so

bull We design measures to quantify the extent and limitsof the transformations generative models can achieveand show experimental evidence We further show the

1We use the term ldquosteerablerdquo in analogy to the classic steerable filtersof Freeman amp Adelson [7]

relation between shiftability of model distribution andthe variability in datasets

bull We demonstrate these transformations as a general-purpose framework on different model architectureseg BigGAN and StyleGAN illustrating differentdisentanglement properties in their respective latentspaces

2 Related workWalking in the latent space can be seen from different

perspectives how to achieve it what limits it and whatdoes it enable us to do Our work addresses these threeaspects together and we briefly refer to each one in priorwork

Interpolations in GAN latent space Traditional ap-proaches to image editing in latent space involve findinglinear directions in GAN latent space that correspond tochanges in labeled attributes such as smile-vectors andgender-vectors for faces [20 13] However smoothly vary-ing latent space interpolations are not exclusive to GANsin flow-based generative models linear interpolations be-tween encoded images allow one to edit a source image to-ward attributes contained in a separate target [16] WithDeep Feature Interpolation [24] one can remove the gener-ative model entirely Instead the interpolation operation isperformed on an intermediate feature space of a pretrainedclassifier again using the feature mappings of source andtarget sets to determine an edit direction Unlike these ap-proaches we learn our latent-space trajectories in a self-supervised manner without relying on labeled attributes ordistinct source and target images We measure the edit dis-tance in pixel space rather than latent space because ourdesired target image may not directly exist in the latentspace of the generative model Despite allowing for non-linear trajectories we find that linear trajectories in the la-tent space model simple image manipulations ndash eg zoom-vectors and shift-vectors ndash surprisingly well even whenmodels were not explicitly trained to exhibit this propertyin latent space

Dataset bias Biases from training data and network ar-chitecture both factor into the generalization capacity oflearned models [23 8 1] Dataset biases partly comes fromhuman preferences in taking photos we typically captureimages in specific ldquocanonicalrdquo views that are not fully rep-resentative of the entire visual world [19 12] When mod-els are trained to fit these datasets they inherit the biasesin the data Such biases may result in models that misrepre-sent the given task ndash such as tendencies towards texture biasrather than shape bias on ImageNet classifiers [8] ndash whichin turn limits their generalization performance on similar

objectives [2] Our latent space trajectories transform theoutput corresponding to various camera motion and imageediting operations but ultimately we are constrained by bi-ases in the data and cannot extrapolate arbitrarily far beyondthe dataset

Deep learning for content creation The recent progressin generative models has enabled interesting applicationsfor content creation [4 13] including variants that en-able end users to control and fine-tune the generated out-put [22 26 3] A by-product the current work is to furtherenable users to modify various image properties by turninga single knob ndash the magnitude of the learned transformationSimilar to [26] we show that GANs allow users to achievebasic image editing operations by manipulating the latentspace However we further demonstrate that these imagemanipulations are not just a simple creativity tool they alsoprovide us with a window into biases and generalization ca-pacity of these models

Concurrent work We note a few concurrent papers thatalso explore trajectories in GAN latent space [6] learns lin-ear walks in the latent space that correspond to various fa-cial characteristics they use these walks to measure biasesin facial attribute detectors whereas we study biases in thegenerative model that originate from training data [21] alsoassumes linear latent space trajectories and learns paths forface attribute editing according to semantic concepts suchas age and expression thus demonstrating disentanglementproperties of the latent space [9] applies a linear walk toachieve transformations in learning and editing features thatpertain cognitive properties of an image such as memorabil-ity aesthetics and emotional valence Unlike these workswe do not require an attribute detector or assessor functionto learn the latent space trajectory and therefore our lossfunction is based on image similarity between source andtarget images In addition to linear walks we explore us-ing non-linear walks to achieve camera motion and colortransformations

3 Method

Generative models such as GANs [10] learn a mappingfunction G such that G z rarr x Here z is the latentcode drawn from a Gaussian density and x is an outputeg an image Our goal is to achieve transformations inthe output space by walking in the latent space as shownin Fig 2 In general this goal also captures the idea inequivariance where transformations in the input space resultin equivalent transformations in the output space (eg referto [11 5 17])

z

αw

Latent Space Generated Image

edit(G(z α))

G(z)

Target Image

G(z + αw)

Z X

z

f(z)

Non-Linear walk Linear walk

z + αwf( f( f( f(z))))

Z

OR

Optimal path

Figure 2 We aim to find a path in z space to transform thegenerated image G(z) to its edited version edit(G(z α))eg an αtimes zoom This walk results in the generated imageG(z+αw) when we choose a linear walk orG(f(f((z)))when we choose a non-linear walk

Objective How can we maneuver in the latent space todiscover different transformations on a given image Wedesire to learn a vector representing the optimal path forthis latent space transformation which we parametrize asan N -dimensional learned vector We weight this vector bya continuous parameter α which signifies the step size ofthe transformation large α values correspond to a greaterdegree of transformation while small α values correspondto a lesser degree Formally we learn the walk w by mini-mizing the objective function

wlowast = argminw

Ezα[L(G(z+αw)edit(G(z) α))] (1)

Here L measures the distance between the generated im-age after taking an α-step in the latent direction G(z+αw)and the target edit(G(z) α) derived from the source im-age G(z) We use L2 loss as our objective L however wealso obtain similar results when using the LPIPS perceptualimage similarity metric [25] (see B41) Note that we canlearn this walk in a fully self-supervised manner ndash we per-form our desired transformation on an arbitrary generatedimage and subsequently we optimize our walk to satisfythe objective We learn the optimal walk wlowast to model thetransformation applied in edit(middot) Let model(α) de-note such transformation in the direction of wlowast controlledwith the variable step size α defined as model(α) =G(z + αwlowast)

The previous setup assumes linear latent space walks butwe can also optimize for non-linear trajectories in whichthe walk direction depends on the current latent spaceposition For the non-linear walk we learn a functionflowast(z) which corresponds to a small ε-step transforma-tion edit(G(z) ε) To achieve bigger transformations welearn to apply f recursively mimicking discrete Euler ODEapproximations Formally for a fixed ε we minimize

L = Ezn[||fn(z)minus edit(G(z) nε))||] (2)

where fn(middot) is an nth-order function compositionf(f(f())) and f(z) is parametrized with a neural net-work We discuss further implementation details in A4

- Rotate 2D +058 005005

- Shift Y +058 013021

032 003001- Zoom +

Figure 3 Transformation limits As we increase the magnitude of the learned walk vector we reach a limit to how muchwe can achieve a given transformation The learned operation either ceases to transform the image any further or the imagestarts to deviate from the natural image manifold Below each figure we also indicate the average LPIPS perceptual distancebetween 200 sampled image pairs of that category Perceptual distance decreases as we move farther from the source (centerimage) which indicates that the images are converging

We use this function composition approach rather than thesimpler setup ofG(z+αNN(z)) because the latter learns toignore the input z when α takes on continuous values andis thus equivalent to the previous linear trajectory (see A3for further details)

Quantifying Steerability We further seek to quantify theextent to which we are able to achieve the desired imagemanipulations with respect to each transformation To thisend we compare the distribution of a given attribute egldquoluminancerdquo in the dataset versus in images generated afterwalking in latent space

For color transformations we consider the effect of in-creasing or decreasing the α coefficient corresponding toeach color channel To estimate the color distribution ofmodel-generated images we randomly sample N = 100pixels per image both before and after taking a step in latentspace Then we compute the pixel value for each channelor the mean RGB value for luminance and normalize therange between 0 and 1

For zoom and shift transformations we rely on an ob-ject detector which captures the central object in the imageclass We use a MobileNet-SSD v1 [18] detector to esti-mate object bounding boxes and restrict ourselves to imageclasses recognizable by the detector For each successfuldetection we take the highest probability bounding box cor-responding to the desired class and use that to quantify thedegree to which we are able to achieve each transformationFor the zoom operation we use the area of the boundingbox normalized by the area of the total image For shift in

the X and Y directions we take the center X and Y coordi-nates of the bounding box and normalize by image widthor height

Truncation parameters in GANs (as used in [4 13]) tradeoff between variety of the generated images and samplequality When comparing generated images to the datasetdistribution we use the largest possible truncation we usethe largest possible truncation for the model and performsimilar cropping and resizing of the dataset as done dur-ing model training (see [4]) When comparing the attributesof generated distributions under different α magnitudes toeach other but not to the dataset we reduce truncation to05 to ensure better performance of the object detector

4 ExperimentsWe demonstrate our approach using BigGAN [4] a

class-conditional GAN trained on 1000 ImageNet cate-gories We learn a shared latent space walk by averagingacross the image categories and further quantify how thiswalk affects each class differently We focus on linear walksin latent space for the main text and show additional resultson nonlinear walks in Sec 43 and B42 We also conductexperiments on StyleGAN [13] which uses an uncondi-tional style-based generator architecture in Sec 43 and B5

41 What image transformations can we achieve bysteering in the latent space

We show qualitative results of the learned transforma-tions in Fig 1 By walking in the generator latent spacewe are able to learn a variety of transformations on a given

source image (shown in the center image of each transfor-mation) Interestingly several priors come into play whenlearning these image transformations When we shift adaisy downwards in the Y direction the model hallucinatesthat the sky exists on the top of the image However whenwe shift the daisy up the model inpaints the remainder ofthe image with grass When we alter the brightness of a im-age the model transitions between nighttime and daytimeThis suggests that the model is able to extrapolate from theoriginal source image and yet remain consistent with theimage context

We further ask the question how much are we able toachieve each transformation When we increase the stepsize of α we observe that the extent to which we canachieve each transformation is limited In Fig 3 we ob-serve two potential failure cases one in which the the im-age becomes unrealistic and the other in which the imagefails to transform any further When we try to zoom in ona Persian cat we observe that the cat no longer increases insize beyond some point In fact the size of the cat underthe latent space transformation consistently undershoots thetarget zoom On the other hand when we try to zoom out onthe cat we observe that it begins to fall off the image man-ifold and at some point is unable to become any smallerIndeed the perceptual distance (using LPIPS) between im-ages decreases as we push α towards the transformationlimits Similar trends hold with other transformations weare able to shift a lorikeet up and down to some degree untilthe transformation fails to have any further effect and yieldsunrealistic output and despite adjusting α on the rotationvector we are unable to rotate a pizza Are the limitationsto these transformations governed by the training datasetIn other words are our latent space walks limited becausein ImageNet photos the cats are mostly centered and takenwithin a certain size We seek to investigate and quantifythese biases in the next sections

An intriguing property of the learned trajectory is thatthe degree to which it affects the output depends on the im-age class In Fig 4 we investigate the impact of the walkfor different image categories under color transformationsBy stepping w in the direction of increasing redness in theimage we are able to successfully recolor a jellyfish but weare unable to change the color of a goldfinch ndash the goldfinchremains yellow and the only effect of w is to slightly red-den the background Likewise increasing the brightness ofa scene transforms an erupting volcano to a dormant vol-cano but does not have such effect on Alps which onlytransitions between night and day In the third example weuse our latent walk to turn red sports cars to blue When weapply this vector to firetrucks the only effect is to slightlychange the color of the sky Again perceptual distance overmultiple image samples per class confirms these qualitativeobservations a 2-sample t-test yields t = 2077 p lt 0001

for jellyfishgoldfinch t = 814 p lt 0001 for volcanoalpand t = 684 p lt 0001 for sports carfire engine Wehypothesize that the differential impact of the same trans-formation on various image classes relates to the variabilityin the underlying dataset Firetrucks only appear in redbut sports cars appear in a variety of colors Therefore ourcolor transformation is constrained by the biases exhibitedin individual classes in the dataset

- Blueness +

- Darkness +

- Redness +jellyfish goldfinch

00

05

10

Per

ceptu

alD

ista

nce

sports car fire engine00

02

04

Per

ceptu

alD

ista

nce

volcano alp00

05

Per

ceptu

alD

ista

nce

Figure 4 Each pair of rows shows how a single latent direc-tion w affects two different ImageNet classes We observethat the same direction affects different classes differentlyand in a way consistent with semantic priors (eg ldquoVolca-noesrdquo explode ldquoAlpsrdquo do not) Boxplots show the LPIPSperceptual distance before and after transformation for 200samples per class

Motivated by our qualitative observations in Fig 1Fig 3 and Fig 4 we next quantitatively investigate thedegree to which we are able to achieve certain transforma-tions

Limits of Transformations In quantifying the extent ofour transformations we are bounded by two opposing fac-tors how much we can transform the distribution while alsoremaining within the manifold of realistic looking images

For the color (luminance) transformation we observethat by changing α we shift the proportion of light anddark colored pixels and can do so without large increases inFID (Fig 5 first column) However the extent to which wecan move the distribution is limited and each transformeddistribution still retains some probability mass over the fullrange of pixel intensities

10 05 00 05 10crarr

00

05

10

Inte

rsec

tion

200 100 0 100 200crarr

00

05

10

Inte

rsec

tion

200 100 0 100 200crarr

00

05

10

Inte

rsec

tion

2 0 2log(crarr)

00

05

10

Inte

rsec

tion

00 02 04 06 08 10Pixel Intensity

000

025

050

075

100

125

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

20

25

30

PD

F

105

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

10 05 00 05 10crarr

20

40

FID

200 100 0 100 200crarr

20

40

FID

200 100 0 100 200crarr

20

40

FID

2 0 2log(crarr)

20

40

FID

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Pixel Intensity

000

025

050

075

100

125

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

20

25

30

PD

F

105

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

Figure 5 Quantifying the extent of transformations We compare the attributes of generated images under the raw modeloutput G(z) compared to the distribution under a learned transformation model(α) We measure the intersection betweenG(z) and model(α) and also compute the FID on the transformed image to limit our transformations to the natural imagemanifold

Using the shift vector we show that we are able to movethe distribution of the center object by changing α In theunderlying model the center coordinate of the object ismost concentrated at half of the image width and height butafter applying the shift in X and shift in Y transformationthe mode of the transformed distribution varies between 03and 07 of the image widthheight To quantify the extentto which the distribution changes we compute the area ofintersection between the original model distribution and thedistribution after applying the transformation and we ob-serve that the intersection decreases as we increase or de-crease the magnitude of α However our transformationsare limited to a certain extent ndash if we increase α beyond 150pixels for vertical shifts we start to generate unrealistic im-ages as evidenced by a sharp rise in FID and convergingmodes in the transformed distributions (Fig 5 columns 2 amp3)

We perform a similar procedure for the zoom operationby measuring the area of the bounding box for the detectedobject under different magnitudes of α Like shift we ob-serve that subsequent increases in α magnitude start to havesmaller and smaller effects on the mode of the resulting dis-tribution (Fig 5 last column) Past an 8x zoom in or outwe observe an increase in the FID of the transformed out-put signifying decreasing image quality Interestingly for

zoom the FID under zooming in and zooming out is anti-symmetric indicating that the success to which we are ableto achieve the zoom-in operation under the constraint of re-alistic looking images differs from that of zooming out

These trends are consistent with the plateau in trans-formation behavior that we qualitatively observe in Fig 3Although we can arbitrarily increase the α step size aftersome point we are unable to achieve further transformationand risk deviating from the natural image manifold

42 How does the data affect the transformations

How do the statistics of the data distribution affect theability to achieve these transformations In other words isthe extent to which we can transform each class as we ob-served in Fig 4 due to limited variability in the underlyingdataset for each class

One intuitive way of quantifying the amount of transfor-mation achieved dependent on the variability in the datasetis to measure the difference in transformed model meansmodel(+α) and model(-α) and compare it to thespread of the dataset distribution For each class we com-pute standard deviation of the dataset with respect to ourstatistic of interest (pixel RGB value for color and bound-ing box area and center value for zoom and shift transfor-mations respectively) We hypothesize that if the amount of

000 025 050 075 100

Area

0

2

4

P(A

)

105

000 025 050 075 100

Area

0

2

4

P(A

)

105

020 025 030

Data

00

01

02

03

Model

micro

Zoom

r = 037 p lt 0001

005 010 015

Data

00

01

02

03

Model

micro

Shift Y

r = 039 p lt 0001

010 015 020

Data

00

01

02

03

Model

micro

Shift X

r = 028 p lt 0001

020 025 030

Data

00

01

02

03

Model

micro

Luminance

r = 059 p lt 0001

- Zoom +

Rob

inLa

ptop

Figure 6 Understanding per-class biases We observe a correlation between the variability in the training data for ImageNetclasses and our ability to shift the distribution under latent space transformations Classes with low variability (eg robin)limit our ability to achieve desired transformations in comparison to classes with a broad dataset distribution (eg laptop)

transformation is biased depending on the image class wewill observe a correlation between the distance of the meanshifts and the standard deviation of the data distribution

More concretely we define the change in model meansunder a given transformation as

∆microk =∥∥microkmodel(+αlowast) minus microkmodel(-αlowast)

∥∥2 (3)

for a given class k under the optimal walk for each trans-formation wlowast under Eq 5 and we set αlowast to be largest andsmallest α values used in training We note that the degreeto which we achieve each transformation is a function of αso we use the same value of α across all classes ndash one thatis large enough to separate the means of microkmodel(αlowast) andmicrokmodel(-αlowast) under transformation but also for which theFID of the generated distribution remains below a thresholdT of generating reasonably realistic images (for our experi-ments we use T = 22)

We apply this measure in Fig 6 where on the x-axis weplot the standard deviation σ of the dataset and on the y-axis we show the model ∆micro under a +αlowast andminusαlowast transfor-mation as defined in Eq 3 We sample randomly from 100classes for the color zoom and shift transformations andgenerate 200 samples of each class under the positive andnegative transformations We use the same setup of draw-ing samples from the model and dataset and computing the

statistics for each transformation as described in Sec 41Indeed we find that the width of the dataset distribu-

tion captured by the standard deviation of random samplesdrawn from the dataset for each class is related to the extentto which we achieve each transformation For these trans-formations we observe a positive correlation between thespread of the dataset and the magnitude of ∆micro observedin the transformed model distributions Furthermore theslope of all observed trends differs significantly from zero(p lt 0001 for all transformations)

For the zoom transformation we show examples of twoextremes along the trend We show an example of theldquorobinrdquo class in which the spread σ in the dataset is low andsubsequently the separation ∆micro that we are able to achieveby applying +αlowast and minusαlowast transformations is limited Onthe other hand in the ldquolaptoprdquo class the underlying spreadin the dataset is broad ImageNet contains images of laptopsof various sizes and we are able to attain a wider shift in themodel distribution when taking a αlowast-sized step in the latentspace

From these results we conclude that the extent to whichwe achieve each transformation is related to the variabil-ity that inherently exists in the dataset Consistent with ourqualitative observations in Fig 4 we find that if the im-ages for a particular class have adequate coverage over the

- Zoom + - Zoom +2x 4x

Linear Lpips

8x1x05x025x0125x 2x 4x 8x1x05x025x0125x

Linear L2 Non-linear L2

Non-linear Lpips

Figure 7 Comparison of linear and nonlinear walks for the zoom operation The linear walk undershoots the targeted levelof transformation but maintains more realistic output

entire range of a given transformation then we are betterable to move the model distribution to both extremes Onthe other hand if the images for the given class exhibit lessvariability the extent of our transformation is limited by thisproperty of the dataset

43 Alternative architectures and walks

We ran an identical set of experiments using the nonlin-ear walk in the BigGAN latent space and obtain similarquantitative results To summarize the Pearsonrsquos correla-tion coefficient between dataset σ and model ∆micro for linearwalks and nonlinear walks is shown in Table 1 and fullresults in B42 Qualitatively we observe that while thelinear trajectory undershoots the targeted level of transfor-mation it is able to preserve more realistic-looking results(Fig 7) The transformations involve a trade-off betweenminimizing the loss and maintaining realistic output andwe hypothesize that the linear walk functions as an implicitregularizer that corresponds well with the inherent organi-zation of the latent space

Luminance Shift X Shift Y ZoomLinear 059 028 039 037

Non-linear 049 049 055 060

Table 1 Pearsonrsquos correlation coefficient between datasetσ and model ∆micro for measured attributes p-value for slopelt 0001 for all transformations

To test the generality of our findings across model archi-tecture we ran similar experiments on StyleGAN in whichthe latent space is divided into two spaces z andW As [13]notes that theW space is less entangled than z we apply thelinear walk to W and show results in Fig 8 and B5 Oneinteresting aspect of StyleGAN is that we can change colorwhile leaving other structure in the image unchanged Inother words while green faces do not naturally exist in thedataset the StyleGAN model is still able to generate themThis differs from the behavior of BigGAN where changing

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

- Luminance +

- Red +

- Green +

- Blue +Figure 8 Distribution for luminance transformation learnedfrom the StyleGAN cars generator and qualitative exam-ples of color transformations on various datasets usingStyleGAN

color results in different semantics in the image eg turn-ing a dormant volcano to an active one or a daytime sceneto a nighttime one StyleGAN however does not preservethe exact geometry of objects under other transformationseg zoom and shift (see Sec B5)

5 Conclusion

GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much

AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx

Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC

References[1] A Amini A Soleimany W Schwarting S Bhatia and

D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2

[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2

[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3

[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4

[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3

[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3

[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2

[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2

[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3

[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3

[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3

[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2

[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12

[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24

[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10

[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2

[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3

[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4

[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2

[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2

[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12

[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3

[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2

[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2

[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11

[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3

Appendix

A Method detailsA1 Optimization for the linear walk

We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model

A2 Implementation Details for Linear Walk

We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below

Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage

Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene

Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB

where we jointly learn the three walk directions wR wGand wB

Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget

Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing

A3 Linear NN(z) walk

Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z

Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is

edit(G(z) 2α) = edit(edit(G(z) α) α)

To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows

z1 = z + αF (z)

z2 = z1 + αF (z1)

However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α

z + 2αF (z) = z1 + αF (z1) (4)

Therefore

z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)

Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z

A4 Optimization for the non-linear walk

Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)

We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps

B Additional ExperimentsB1 Model and Data Distributions

How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data

B2 Quantifying Transformation Limits

We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases

Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent

trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)

B3 Detected Bounding Boxes

To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis

B4 Alternative Walks in BigGAN

B41 LPIPS objective

In the main text we learn the latent space walk w by mini-mizing the objective function

J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)

using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)

B42 Non-linear Walks

Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic

2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API

B43 Additional Qualitative Examples

We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively

B5 Walks in StyleGAN

We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses

in Figs 20 22 24

B6 Qualitative examples for additional transfor-mations

Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)

00 02 04 06 08 10Area

05

10

15

20

PD

F

105

data

model

00 02 04 06 08 10Center Y

0

1

2

PD

F

102

data

model

00 02 04 06 08 10Center X

00

05

10

15

20

PD

F

102

data

model

00 02 04 06 08 10Pixel Intensity

2

3

4

5

PD

F

103

data

model

ZoomShift YShift XLuminance

Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset

Luminance

Rotate 2D

Shift X Shift Y

Zoom

10 05 00 05 10crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

50 0 50crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

2 1 0 1 2log(crarr)

00

02

04

06

08

Per

ceptu

alD

ista

nce

Rotate 3D

500 250 0 250 500crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude

Figure 11 Bounding boxes for random selected classes using ImageNet training images

Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α

Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)

Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk

10 05 00 05 10crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100In

ters

ection

2 0 2log(crarr)

000

025

050

075

100

Inte

rsec

tion

020 025 030Data

01

02

03

04

05

Model

micro

010 015 020Data

00

02

04

Model

micro

005 010 015Data

01

02

03

04

Model

micro

02 03Data

02

04

06

08

Model

micro

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

10 05 00 05 10crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

2 0 2log(crarr)

20

30

40

50

FID

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25P

DF

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

p=049 p=049 p=055 p=060

Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation

- Color +- Zoom +

- Shift X + - Shift Y +

- Rotate 2D + - Rotate 3D +

Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 3: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

objectives [2] Our latent space trajectories transform theoutput corresponding to various camera motion and imageediting operations but ultimately we are constrained by bi-ases in the data and cannot extrapolate arbitrarily far beyondthe dataset

Deep learning for content creation The recent progressin generative models has enabled interesting applicationsfor content creation [4 13] including variants that en-able end users to control and fine-tune the generated out-put [22 26 3] A by-product the current work is to furtherenable users to modify various image properties by turninga single knob ndash the magnitude of the learned transformationSimilar to [26] we show that GANs allow users to achievebasic image editing operations by manipulating the latentspace However we further demonstrate that these imagemanipulations are not just a simple creativity tool they alsoprovide us with a window into biases and generalization ca-pacity of these models

Concurrent work We note a few concurrent papers thatalso explore trajectories in GAN latent space [6] learns lin-ear walks in the latent space that correspond to various fa-cial characteristics they use these walks to measure biasesin facial attribute detectors whereas we study biases in thegenerative model that originate from training data [21] alsoassumes linear latent space trajectories and learns paths forface attribute editing according to semantic concepts suchas age and expression thus demonstrating disentanglementproperties of the latent space [9] applies a linear walk toachieve transformations in learning and editing features thatpertain cognitive properties of an image such as memorabil-ity aesthetics and emotional valence Unlike these workswe do not require an attribute detector or assessor functionto learn the latent space trajectory and therefore our lossfunction is based on image similarity between source andtarget images In addition to linear walks we explore us-ing non-linear walks to achieve camera motion and colortransformations

3 Method

Generative models such as GANs [10] learn a mappingfunction G such that G z rarr x Here z is the latentcode drawn from a Gaussian density and x is an outputeg an image Our goal is to achieve transformations inthe output space by walking in the latent space as shownin Fig 2 In general this goal also captures the idea inequivariance where transformations in the input space resultin equivalent transformations in the output space (eg referto [11 5 17])

z

αw

Latent Space Generated Image

edit(G(z α))

G(z)

Target Image

G(z + αw)

Z X

z

f(z)

Non-Linear walk Linear walk

z + αwf( f( f( f(z))))

Z

OR

Optimal path

Figure 2 We aim to find a path in z space to transform thegenerated image G(z) to its edited version edit(G(z α))eg an αtimes zoom This walk results in the generated imageG(z+αw) when we choose a linear walk orG(f(f((z)))when we choose a non-linear walk

Objective How can we maneuver in the latent space todiscover different transformations on a given image Wedesire to learn a vector representing the optimal path forthis latent space transformation which we parametrize asan N -dimensional learned vector We weight this vector bya continuous parameter α which signifies the step size ofthe transformation large α values correspond to a greaterdegree of transformation while small α values correspondto a lesser degree Formally we learn the walk w by mini-mizing the objective function

wlowast = argminw

Ezα[L(G(z+αw)edit(G(z) α))] (1)

Here L measures the distance between the generated im-age after taking an α-step in the latent direction G(z+αw)and the target edit(G(z) α) derived from the source im-age G(z) We use L2 loss as our objective L however wealso obtain similar results when using the LPIPS perceptualimage similarity metric [25] (see B41) Note that we canlearn this walk in a fully self-supervised manner ndash we per-form our desired transformation on an arbitrary generatedimage and subsequently we optimize our walk to satisfythe objective We learn the optimal walk wlowast to model thetransformation applied in edit(middot) Let model(α) de-note such transformation in the direction of wlowast controlledwith the variable step size α defined as model(α) =G(z + αwlowast)

The previous setup assumes linear latent space walks butwe can also optimize for non-linear trajectories in whichthe walk direction depends on the current latent spaceposition For the non-linear walk we learn a functionflowast(z) which corresponds to a small ε-step transforma-tion edit(G(z) ε) To achieve bigger transformations welearn to apply f recursively mimicking discrete Euler ODEapproximations Formally for a fixed ε we minimize

L = Ezn[||fn(z)minus edit(G(z) nε))||] (2)

where fn(middot) is an nth-order function compositionf(f(f())) and f(z) is parametrized with a neural net-work We discuss further implementation details in A4

- Rotate 2D +058 005005

- Shift Y +058 013021

032 003001- Zoom +

Figure 3 Transformation limits As we increase the magnitude of the learned walk vector we reach a limit to how muchwe can achieve a given transformation The learned operation either ceases to transform the image any further or the imagestarts to deviate from the natural image manifold Below each figure we also indicate the average LPIPS perceptual distancebetween 200 sampled image pairs of that category Perceptual distance decreases as we move farther from the source (centerimage) which indicates that the images are converging

We use this function composition approach rather than thesimpler setup ofG(z+αNN(z)) because the latter learns toignore the input z when α takes on continuous values andis thus equivalent to the previous linear trajectory (see A3for further details)

Quantifying Steerability We further seek to quantify theextent to which we are able to achieve the desired imagemanipulations with respect to each transformation To thisend we compare the distribution of a given attribute egldquoluminancerdquo in the dataset versus in images generated afterwalking in latent space

For color transformations we consider the effect of in-creasing or decreasing the α coefficient corresponding toeach color channel To estimate the color distribution ofmodel-generated images we randomly sample N = 100pixels per image both before and after taking a step in latentspace Then we compute the pixel value for each channelor the mean RGB value for luminance and normalize therange between 0 and 1

For zoom and shift transformations we rely on an ob-ject detector which captures the central object in the imageclass We use a MobileNet-SSD v1 [18] detector to esti-mate object bounding boxes and restrict ourselves to imageclasses recognizable by the detector For each successfuldetection we take the highest probability bounding box cor-responding to the desired class and use that to quantify thedegree to which we are able to achieve each transformationFor the zoom operation we use the area of the boundingbox normalized by the area of the total image For shift in

the X and Y directions we take the center X and Y coordi-nates of the bounding box and normalize by image widthor height

Truncation parameters in GANs (as used in [4 13]) tradeoff between variety of the generated images and samplequality When comparing generated images to the datasetdistribution we use the largest possible truncation we usethe largest possible truncation for the model and performsimilar cropping and resizing of the dataset as done dur-ing model training (see [4]) When comparing the attributesof generated distributions under different α magnitudes toeach other but not to the dataset we reduce truncation to05 to ensure better performance of the object detector

4 ExperimentsWe demonstrate our approach using BigGAN [4] a

class-conditional GAN trained on 1000 ImageNet cate-gories We learn a shared latent space walk by averagingacross the image categories and further quantify how thiswalk affects each class differently We focus on linear walksin latent space for the main text and show additional resultson nonlinear walks in Sec 43 and B42 We also conductexperiments on StyleGAN [13] which uses an uncondi-tional style-based generator architecture in Sec 43 and B5

41 What image transformations can we achieve bysteering in the latent space

We show qualitative results of the learned transforma-tions in Fig 1 By walking in the generator latent spacewe are able to learn a variety of transformations on a given

source image (shown in the center image of each transfor-mation) Interestingly several priors come into play whenlearning these image transformations When we shift adaisy downwards in the Y direction the model hallucinatesthat the sky exists on the top of the image However whenwe shift the daisy up the model inpaints the remainder ofthe image with grass When we alter the brightness of a im-age the model transitions between nighttime and daytimeThis suggests that the model is able to extrapolate from theoriginal source image and yet remain consistent with theimage context

We further ask the question how much are we able toachieve each transformation When we increase the stepsize of α we observe that the extent to which we canachieve each transformation is limited In Fig 3 we ob-serve two potential failure cases one in which the the im-age becomes unrealistic and the other in which the imagefails to transform any further When we try to zoom in ona Persian cat we observe that the cat no longer increases insize beyond some point In fact the size of the cat underthe latent space transformation consistently undershoots thetarget zoom On the other hand when we try to zoom out onthe cat we observe that it begins to fall off the image man-ifold and at some point is unable to become any smallerIndeed the perceptual distance (using LPIPS) between im-ages decreases as we push α towards the transformationlimits Similar trends hold with other transformations weare able to shift a lorikeet up and down to some degree untilthe transformation fails to have any further effect and yieldsunrealistic output and despite adjusting α on the rotationvector we are unable to rotate a pizza Are the limitationsto these transformations governed by the training datasetIn other words are our latent space walks limited becausein ImageNet photos the cats are mostly centered and takenwithin a certain size We seek to investigate and quantifythese biases in the next sections

An intriguing property of the learned trajectory is thatthe degree to which it affects the output depends on the im-age class In Fig 4 we investigate the impact of the walkfor different image categories under color transformationsBy stepping w in the direction of increasing redness in theimage we are able to successfully recolor a jellyfish but weare unable to change the color of a goldfinch ndash the goldfinchremains yellow and the only effect of w is to slightly red-den the background Likewise increasing the brightness ofa scene transforms an erupting volcano to a dormant vol-cano but does not have such effect on Alps which onlytransitions between night and day In the third example weuse our latent walk to turn red sports cars to blue When weapply this vector to firetrucks the only effect is to slightlychange the color of the sky Again perceptual distance overmultiple image samples per class confirms these qualitativeobservations a 2-sample t-test yields t = 2077 p lt 0001

for jellyfishgoldfinch t = 814 p lt 0001 for volcanoalpand t = 684 p lt 0001 for sports carfire engine Wehypothesize that the differential impact of the same trans-formation on various image classes relates to the variabilityin the underlying dataset Firetrucks only appear in redbut sports cars appear in a variety of colors Therefore ourcolor transformation is constrained by the biases exhibitedin individual classes in the dataset

- Blueness +

- Darkness +

- Redness +jellyfish goldfinch

00

05

10

Per

ceptu

alD

ista

nce

sports car fire engine00

02

04

Per

ceptu

alD

ista

nce

volcano alp00

05

Per

ceptu

alD

ista

nce

Figure 4 Each pair of rows shows how a single latent direc-tion w affects two different ImageNet classes We observethat the same direction affects different classes differentlyand in a way consistent with semantic priors (eg ldquoVolca-noesrdquo explode ldquoAlpsrdquo do not) Boxplots show the LPIPSperceptual distance before and after transformation for 200samples per class

Motivated by our qualitative observations in Fig 1Fig 3 and Fig 4 we next quantitatively investigate thedegree to which we are able to achieve certain transforma-tions

Limits of Transformations In quantifying the extent ofour transformations we are bounded by two opposing fac-tors how much we can transform the distribution while alsoremaining within the manifold of realistic looking images

For the color (luminance) transformation we observethat by changing α we shift the proportion of light anddark colored pixels and can do so without large increases inFID (Fig 5 first column) However the extent to which wecan move the distribution is limited and each transformeddistribution still retains some probability mass over the fullrange of pixel intensities

10 05 00 05 10crarr

00

05

10

Inte

rsec

tion

200 100 0 100 200crarr

00

05

10

Inte

rsec

tion

200 100 0 100 200crarr

00

05

10

Inte

rsec

tion

2 0 2log(crarr)

00

05

10

Inte

rsec

tion

00 02 04 06 08 10Pixel Intensity

000

025

050

075

100

125

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

20

25

30

PD

F

105

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

10 05 00 05 10crarr

20

40

FID

200 100 0 100 200crarr

20

40

FID

200 100 0 100 200crarr

20

40

FID

2 0 2log(crarr)

20

40

FID

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Pixel Intensity

000

025

050

075

100

125

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

20

25

30

PD

F

105

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

Figure 5 Quantifying the extent of transformations We compare the attributes of generated images under the raw modeloutput G(z) compared to the distribution under a learned transformation model(α) We measure the intersection betweenG(z) and model(α) and also compute the FID on the transformed image to limit our transformations to the natural imagemanifold

Using the shift vector we show that we are able to movethe distribution of the center object by changing α In theunderlying model the center coordinate of the object ismost concentrated at half of the image width and height butafter applying the shift in X and shift in Y transformationthe mode of the transformed distribution varies between 03and 07 of the image widthheight To quantify the extentto which the distribution changes we compute the area ofintersection between the original model distribution and thedistribution after applying the transformation and we ob-serve that the intersection decreases as we increase or de-crease the magnitude of α However our transformationsare limited to a certain extent ndash if we increase α beyond 150pixels for vertical shifts we start to generate unrealistic im-ages as evidenced by a sharp rise in FID and convergingmodes in the transformed distributions (Fig 5 columns 2 amp3)

We perform a similar procedure for the zoom operationby measuring the area of the bounding box for the detectedobject under different magnitudes of α Like shift we ob-serve that subsequent increases in α magnitude start to havesmaller and smaller effects on the mode of the resulting dis-tribution (Fig 5 last column) Past an 8x zoom in or outwe observe an increase in the FID of the transformed out-put signifying decreasing image quality Interestingly for

zoom the FID under zooming in and zooming out is anti-symmetric indicating that the success to which we are ableto achieve the zoom-in operation under the constraint of re-alistic looking images differs from that of zooming out

These trends are consistent with the plateau in trans-formation behavior that we qualitatively observe in Fig 3Although we can arbitrarily increase the α step size aftersome point we are unable to achieve further transformationand risk deviating from the natural image manifold

42 How does the data affect the transformations

How do the statistics of the data distribution affect theability to achieve these transformations In other words isthe extent to which we can transform each class as we ob-served in Fig 4 due to limited variability in the underlyingdataset for each class

One intuitive way of quantifying the amount of transfor-mation achieved dependent on the variability in the datasetis to measure the difference in transformed model meansmodel(+α) and model(-α) and compare it to thespread of the dataset distribution For each class we com-pute standard deviation of the dataset with respect to ourstatistic of interest (pixel RGB value for color and bound-ing box area and center value for zoom and shift transfor-mations respectively) We hypothesize that if the amount of

000 025 050 075 100

Area

0

2

4

P(A

)

105

000 025 050 075 100

Area

0

2

4

P(A

)

105

020 025 030

Data

00

01

02

03

Model

micro

Zoom

r = 037 p lt 0001

005 010 015

Data

00

01

02

03

Model

micro

Shift Y

r = 039 p lt 0001

010 015 020

Data

00

01

02

03

Model

micro

Shift X

r = 028 p lt 0001

020 025 030

Data

00

01

02

03

Model

micro

Luminance

r = 059 p lt 0001

- Zoom +

Rob

inLa

ptop

Figure 6 Understanding per-class biases We observe a correlation between the variability in the training data for ImageNetclasses and our ability to shift the distribution under latent space transformations Classes with low variability (eg robin)limit our ability to achieve desired transformations in comparison to classes with a broad dataset distribution (eg laptop)

transformation is biased depending on the image class wewill observe a correlation between the distance of the meanshifts and the standard deviation of the data distribution

More concretely we define the change in model meansunder a given transformation as

∆microk =∥∥microkmodel(+αlowast) minus microkmodel(-αlowast)

∥∥2 (3)

for a given class k under the optimal walk for each trans-formation wlowast under Eq 5 and we set αlowast to be largest andsmallest α values used in training We note that the degreeto which we achieve each transformation is a function of αso we use the same value of α across all classes ndash one thatis large enough to separate the means of microkmodel(αlowast) andmicrokmodel(-αlowast) under transformation but also for which theFID of the generated distribution remains below a thresholdT of generating reasonably realistic images (for our experi-ments we use T = 22)

We apply this measure in Fig 6 where on the x-axis weplot the standard deviation σ of the dataset and on the y-axis we show the model ∆micro under a +αlowast andminusαlowast transfor-mation as defined in Eq 3 We sample randomly from 100classes for the color zoom and shift transformations andgenerate 200 samples of each class under the positive andnegative transformations We use the same setup of draw-ing samples from the model and dataset and computing the

statistics for each transformation as described in Sec 41Indeed we find that the width of the dataset distribu-

tion captured by the standard deviation of random samplesdrawn from the dataset for each class is related to the extentto which we achieve each transformation For these trans-formations we observe a positive correlation between thespread of the dataset and the magnitude of ∆micro observedin the transformed model distributions Furthermore theslope of all observed trends differs significantly from zero(p lt 0001 for all transformations)

For the zoom transformation we show examples of twoextremes along the trend We show an example of theldquorobinrdquo class in which the spread σ in the dataset is low andsubsequently the separation ∆micro that we are able to achieveby applying +αlowast and minusαlowast transformations is limited Onthe other hand in the ldquolaptoprdquo class the underlying spreadin the dataset is broad ImageNet contains images of laptopsof various sizes and we are able to attain a wider shift in themodel distribution when taking a αlowast-sized step in the latentspace

From these results we conclude that the extent to whichwe achieve each transformation is related to the variabil-ity that inherently exists in the dataset Consistent with ourqualitative observations in Fig 4 we find that if the im-ages for a particular class have adequate coverage over the

- Zoom + - Zoom +2x 4x

Linear Lpips

8x1x05x025x0125x 2x 4x 8x1x05x025x0125x

Linear L2 Non-linear L2

Non-linear Lpips

Figure 7 Comparison of linear and nonlinear walks for the zoom operation The linear walk undershoots the targeted levelof transformation but maintains more realistic output

entire range of a given transformation then we are betterable to move the model distribution to both extremes Onthe other hand if the images for the given class exhibit lessvariability the extent of our transformation is limited by thisproperty of the dataset

43 Alternative architectures and walks

We ran an identical set of experiments using the nonlin-ear walk in the BigGAN latent space and obtain similarquantitative results To summarize the Pearsonrsquos correla-tion coefficient between dataset σ and model ∆micro for linearwalks and nonlinear walks is shown in Table 1 and fullresults in B42 Qualitatively we observe that while thelinear trajectory undershoots the targeted level of transfor-mation it is able to preserve more realistic-looking results(Fig 7) The transformations involve a trade-off betweenminimizing the loss and maintaining realistic output andwe hypothesize that the linear walk functions as an implicitregularizer that corresponds well with the inherent organi-zation of the latent space

Luminance Shift X Shift Y ZoomLinear 059 028 039 037

Non-linear 049 049 055 060

Table 1 Pearsonrsquos correlation coefficient between datasetσ and model ∆micro for measured attributes p-value for slopelt 0001 for all transformations

To test the generality of our findings across model archi-tecture we ran similar experiments on StyleGAN in whichthe latent space is divided into two spaces z andW As [13]notes that theW space is less entangled than z we apply thelinear walk to W and show results in Fig 8 and B5 Oneinteresting aspect of StyleGAN is that we can change colorwhile leaving other structure in the image unchanged Inother words while green faces do not naturally exist in thedataset the StyleGAN model is still able to generate themThis differs from the behavior of BigGAN where changing

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

- Luminance +

- Red +

- Green +

- Blue +Figure 8 Distribution for luminance transformation learnedfrom the StyleGAN cars generator and qualitative exam-ples of color transformations on various datasets usingStyleGAN

color results in different semantics in the image eg turn-ing a dormant volcano to an active one or a daytime sceneto a nighttime one StyleGAN however does not preservethe exact geometry of objects under other transformationseg zoom and shift (see Sec B5)

5 Conclusion

GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much

AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx

Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC

References[1] A Amini A Soleimany W Schwarting S Bhatia and

D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2

[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2

[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3

[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4

[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3

[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3

[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2

[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2

[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3

[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3

[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3

[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2

[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12

[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24

[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10

[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2

[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3

[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4

[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2

[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2

[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12

[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3

[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2

[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2

[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11

[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3

Appendix

A Method detailsA1 Optimization for the linear walk

We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model

A2 Implementation Details for Linear Walk

We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below

Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage

Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene

Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB

where we jointly learn the three walk directions wR wGand wB

Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget

Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing

A3 Linear NN(z) walk

Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z

Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is

edit(G(z) 2α) = edit(edit(G(z) α) α)

To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows

z1 = z + αF (z)

z2 = z1 + αF (z1)

However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α

z + 2αF (z) = z1 + αF (z1) (4)

Therefore

z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)

Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z

A4 Optimization for the non-linear walk

Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)

We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps

B Additional ExperimentsB1 Model and Data Distributions

How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data

B2 Quantifying Transformation Limits

We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases

Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent

trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)

B3 Detected Bounding Boxes

To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis

B4 Alternative Walks in BigGAN

B41 LPIPS objective

In the main text we learn the latent space walk w by mini-mizing the objective function

J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)

using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)

B42 Non-linear Walks

Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic

2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API

B43 Additional Qualitative Examples

We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively

B5 Walks in StyleGAN

We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses

in Figs 20 22 24

B6 Qualitative examples for additional transfor-mations

Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)

00 02 04 06 08 10Area

05

10

15

20

PD

F

105

data

model

00 02 04 06 08 10Center Y

0

1

2

PD

F

102

data

model

00 02 04 06 08 10Center X

00

05

10

15

20

PD

F

102

data

model

00 02 04 06 08 10Pixel Intensity

2

3

4

5

PD

F

103

data

model

ZoomShift YShift XLuminance

Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset

Luminance

Rotate 2D

Shift X Shift Y

Zoom

10 05 00 05 10crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

50 0 50crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

2 1 0 1 2log(crarr)

00

02

04

06

08

Per

ceptu

alD

ista

nce

Rotate 3D

500 250 0 250 500crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude

Figure 11 Bounding boxes for random selected classes using ImageNet training images

Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α

Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)

Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk

10 05 00 05 10crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100In

ters

ection

2 0 2log(crarr)

000

025

050

075

100

Inte

rsec

tion

020 025 030Data

01

02

03

04

05

Model

micro

010 015 020Data

00

02

04

Model

micro

005 010 015Data

01

02

03

04

Model

micro

02 03Data

02

04

06

08

Model

micro

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

10 05 00 05 10crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

2 0 2log(crarr)

20

30

40

50

FID

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25P

DF

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

p=049 p=049 p=055 p=060

Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation

- Color +- Zoom +

- Shift X + - Shift Y +

- Rotate 2D + - Rotate 3D +

Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 4: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

- Rotate 2D +058 005005

- Shift Y +058 013021

032 003001- Zoom +

Figure 3 Transformation limits As we increase the magnitude of the learned walk vector we reach a limit to how muchwe can achieve a given transformation The learned operation either ceases to transform the image any further or the imagestarts to deviate from the natural image manifold Below each figure we also indicate the average LPIPS perceptual distancebetween 200 sampled image pairs of that category Perceptual distance decreases as we move farther from the source (centerimage) which indicates that the images are converging

We use this function composition approach rather than thesimpler setup ofG(z+αNN(z)) because the latter learns toignore the input z when α takes on continuous values andis thus equivalent to the previous linear trajectory (see A3for further details)

Quantifying Steerability We further seek to quantify theextent to which we are able to achieve the desired imagemanipulations with respect to each transformation To thisend we compare the distribution of a given attribute egldquoluminancerdquo in the dataset versus in images generated afterwalking in latent space

For color transformations we consider the effect of in-creasing or decreasing the α coefficient corresponding toeach color channel To estimate the color distribution ofmodel-generated images we randomly sample N = 100pixels per image both before and after taking a step in latentspace Then we compute the pixel value for each channelor the mean RGB value for luminance and normalize therange between 0 and 1

For zoom and shift transformations we rely on an ob-ject detector which captures the central object in the imageclass We use a MobileNet-SSD v1 [18] detector to esti-mate object bounding boxes and restrict ourselves to imageclasses recognizable by the detector For each successfuldetection we take the highest probability bounding box cor-responding to the desired class and use that to quantify thedegree to which we are able to achieve each transformationFor the zoom operation we use the area of the boundingbox normalized by the area of the total image For shift in

the X and Y directions we take the center X and Y coordi-nates of the bounding box and normalize by image widthor height

Truncation parameters in GANs (as used in [4 13]) tradeoff between variety of the generated images and samplequality When comparing generated images to the datasetdistribution we use the largest possible truncation we usethe largest possible truncation for the model and performsimilar cropping and resizing of the dataset as done dur-ing model training (see [4]) When comparing the attributesof generated distributions under different α magnitudes toeach other but not to the dataset we reduce truncation to05 to ensure better performance of the object detector

4 ExperimentsWe demonstrate our approach using BigGAN [4] a

class-conditional GAN trained on 1000 ImageNet cate-gories We learn a shared latent space walk by averagingacross the image categories and further quantify how thiswalk affects each class differently We focus on linear walksin latent space for the main text and show additional resultson nonlinear walks in Sec 43 and B42 We also conductexperiments on StyleGAN [13] which uses an uncondi-tional style-based generator architecture in Sec 43 and B5

41 What image transformations can we achieve bysteering in the latent space

We show qualitative results of the learned transforma-tions in Fig 1 By walking in the generator latent spacewe are able to learn a variety of transformations on a given

source image (shown in the center image of each transfor-mation) Interestingly several priors come into play whenlearning these image transformations When we shift adaisy downwards in the Y direction the model hallucinatesthat the sky exists on the top of the image However whenwe shift the daisy up the model inpaints the remainder ofthe image with grass When we alter the brightness of a im-age the model transitions between nighttime and daytimeThis suggests that the model is able to extrapolate from theoriginal source image and yet remain consistent with theimage context

We further ask the question how much are we able toachieve each transformation When we increase the stepsize of α we observe that the extent to which we canachieve each transformation is limited In Fig 3 we ob-serve two potential failure cases one in which the the im-age becomes unrealistic and the other in which the imagefails to transform any further When we try to zoom in ona Persian cat we observe that the cat no longer increases insize beyond some point In fact the size of the cat underthe latent space transformation consistently undershoots thetarget zoom On the other hand when we try to zoom out onthe cat we observe that it begins to fall off the image man-ifold and at some point is unable to become any smallerIndeed the perceptual distance (using LPIPS) between im-ages decreases as we push α towards the transformationlimits Similar trends hold with other transformations weare able to shift a lorikeet up and down to some degree untilthe transformation fails to have any further effect and yieldsunrealistic output and despite adjusting α on the rotationvector we are unable to rotate a pizza Are the limitationsto these transformations governed by the training datasetIn other words are our latent space walks limited becausein ImageNet photos the cats are mostly centered and takenwithin a certain size We seek to investigate and quantifythese biases in the next sections

An intriguing property of the learned trajectory is thatthe degree to which it affects the output depends on the im-age class In Fig 4 we investigate the impact of the walkfor different image categories under color transformationsBy stepping w in the direction of increasing redness in theimage we are able to successfully recolor a jellyfish but weare unable to change the color of a goldfinch ndash the goldfinchremains yellow and the only effect of w is to slightly red-den the background Likewise increasing the brightness ofa scene transforms an erupting volcano to a dormant vol-cano but does not have such effect on Alps which onlytransitions between night and day In the third example weuse our latent walk to turn red sports cars to blue When weapply this vector to firetrucks the only effect is to slightlychange the color of the sky Again perceptual distance overmultiple image samples per class confirms these qualitativeobservations a 2-sample t-test yields t = 2077 p lt 0001

for jellyfishgoldfinch t = 814 p lt 0001 for volcanoalpand t = 684 p lt 0001 for sports carfire engine Wehypothesize that the differential impact of the same trans-formation on various image classes relates to the variabilityin the underlying dataset Firetrucks only appear in redbut sports cars appear in a variety of colors Therefore ourcolor transformation is constrained by the biases exhibitedin individual classes in the dataset

- Blueness +

- Darkness +

- Redness +jellyfish goldfinch

00

05

10

Per

ceptu

alD

ista

nce

sports car fire engine00

02

04

Per

ceptu

alD

ista

nce

volcano alp00

05

Per

ceptu

alD

ista

nce

Figure 4 Each pair of rows shows how a single latent direc-tion w affects two different ImageNet classes We observethat the same direction affects different classes differentlyand in a way consistent with semantic priors (eg ldquoVolca-noesrdquo explode ldquoAlpsrdquo do not) Boxplots show the LPIPSperceptual distance before and after transformation for 200samples per class

Motivated by our qualitative observations in Fig 1Fig 3 and Fig 4 we next quantitatively investigate thedegree to which we are able to achieve certain transforma-tions

Limits of Transformations In quantifying the extent ofour transformations we are bounded by two opposing fac-tors how much we can transform the distribution while alsoremaining within the manifold of realistic looking images

For the color (luminance) transformation we observethat by changing α we shift the proportion of light anddark colored pixels and can do so without large increases inFID (Fig 5 first column) However the extent to which wecan move the distribution is limited and each transformeddistribution still retains some probability mass over the fullrange of pixel intensities

10 05 00 05 10crarr

00

05

10

Inte

rsec

tion

200 100 0 100 200crarr

00

05

10

Inte

rsec

tion

200 100 0 100 200crarr

00

05

10

Inte

rsec

tion

2 0 2log(crarr)

00

05

10

Inte

rsec

tion

00 02 04 06 08 10Pixel Intensity

000

025

050

075

100

125

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

20

25

30

PD

F

105

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

10 05 00 05 10crarr

20

40

FID

200 100 0 100 200crarr

20

40

FID

200 100 0 100 200crarr

20

40

FID

2 0 2log(crarr)

20

40

FID

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Pixel Intensity

000

025

050

075

100

125

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

20

25

30

PD

F

105

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

Figure 5 Quantifying the extent of transformations We compare the attributes of generated images under the raw modeloutput G(z) compared to the distribution under a learned transformation model(α) We measure the intersection betweenG(z) and model(α) and also compute the FID on the transformed image to limit our transformations to the natural imagemanifold

Using the shift vector we show that we are able to movethe distribution of the center object by changing α In theunderlying model the center coordinate of the object ismost concentrated at half of the image width and height butafter applying the shift in X and shift in Y transformationthe mode of the transformed distribution varies between 03and 07 of the image widthheight To quantify the extentto which the distribution changes we compute the area ofintersection between the original model distribution and thedistribution after applying the transformation and we ob-serve that the intersection decreases as we increase or de-crease the magnitude of α However our transformationsare limited to a certain extent ndash if we increase α beyond 150pixels for vertical shifts we start to generate unrealistic im-ages as evidenced by a sharp rise in FID and convergingmodes in the transformed distributions (Fig 5 columns 2 amp3)

We perform a similar procedure for the zoom operationby measuring the area of the bounding box for the detectedobject under different magnitudes of α Like shift we ob-serve that subsequent increases in α magnitude start to havesmaller and smaller effects on the mode of the resulting dis-tribution (Fig 5 last column) Past an 8x zoom in or outwe observe an increase in the FID of the transformed out-put signifying decreasing image quality Interestingly for

zoom the FID under zooming in and zooming out is anti-symmetric indicating that the success to which we are ableto achieve the zoom-in operation under the constraint of re-alistic looking images differs from that of zooming out

These trends are consistent with the plateau in trans-formation behavior that we qualitatively observe in Fig 3Although we can arbitrarily increase the α step size aftersome point we are unable to achieve further transformationand risk deviating from the natural image manifold

42 How does the data affect the transformations

How do the statistics of the data distribution affect theability to achieve these transformations In other words isthe extent to which we can transform each class as we ob-served in Fig 4 due to limited variability in the underlyingdataset for each class

One intuitive way of quantifying the amount of transfor-mation achieved dependent on the variability in the datasetis to measure the difference in transformed model meansmodel(+α) and model(-α) and compare it to thespread of the dataset distribution For each class we com-pute standard deviation of the dataset with respect to ourstatistic of interest (pixel RGB value for color and bound-ing box area and center value for zoom and shift transfor-mations respectively) We hypothesize that if the amount of

000 025 050 075 100

Area

0

2

4

P(A

)

105

000 025 050 075 100

Area

0

2

4

P(A

)

105

020 025 030

Data

00

01

02

03

Model

micro

Zoom

r = 037 p lt 0001

005 010 015

Data

00

01

02

03

Model

micro

Shift Y

r = 039 p lt 0001

010 015 020

Data

00

01

02

03

Model

micro

Shift X

r = 028 p lt 0001

020 025 030

Data

00

01

02

03

Model

micro

Luminance

r = 059 p lt 0001

- Zoom +

Rob

inLa

ptop

Figure 6 Understanding per-class biases We observe a correlation between the variability in the training data for ImageNetclasses and our ability to shift the distribution under latent space transformations Classes with low variability (eg robin)limit our ability to achieve desired transformations in comparison to classes with a broad dataset distribution (eg laptop)

transformation is biased depending on the image class wewill observe a correlation between the distance of the meanshifts and the standard deviation of the data distribution

More concretely we define the change in model meansunder a given transformation as

∆microk =∥∥microkmodel(+αlowast) minus microkmodel(-αlowast)

∥∥2 (3)

for a given class k under the optimal walk for each trans-formation wlowast under Eq 5 and we set αlowast to be largest andsmallest α values used in training We note that the degreeto which we achieve each transformation is a function of αso we use the same value of α across all classes ndash one thatis large enough to separate the means of microkmodel(αlowast) andmicrokmodel(-αlowast) under transformation but also for which theFID of the generated distribution remains below a thresholdT of generating reasonably realistic images (for our experi-ments we use T = 22)

We apply this measure in Fig 6 where on the x-axis weplot the standard deviation σ of the dataset and on the y-axis we show the model ∆micro under a +αlowast andminusαlowast transfor-mation as defined in Eq 3 We sample randomly from 100classes for the color zoom and shift transformations andgenerate 200 samples of each class under the positive andnegative transformations We use the same setup of draw-ing samples from the model and dataset and computing the

statistics for each transformation as described in Sec 41Indeed we find that the width of the dataset distribu-

tion captured by the standard deviation of random samplesdrawn from the dataset for each class is related to the extentto which we achieve each transformation For these trans-formations we observe a positive correlation between thespread of the dataset and the magnitude of ∆micro observedin the transformed model distributions Furthermore theslope of all observed trends differs significantly from zero(p lt 0001 for all transformations)

For the zoom transformation we show examples of twoextremes along the trend We show an example of theldquorobinrdquo class in which the spread σ in the dataset is low andsubsequently the separation ∆micro that we are able to achieveby applying +αlowast and minusαlowast transformations is limited Onthe other hand in the ldquolaptoprdquo class the underlying spreadin the dataset is broad ImageNet contains images of laptopsof various sizes and we are able to attain a wider shift in themodel distribution when taking a αlowast-sized step in the latentspace

From these results we conclude that the extent to whichwe achieve each transformation is related to the variabil-ity that inherently exists in the dataset Consistent with ourqualitative observations in Fig 4 we find that if the im-ages for a particular class have adequate coverage over the

- Zoom + - Zoom +2x 4x

Linear Lpips

8x1x05x025x0125x 2x 4x 8x1x05x025x0125x

Linear L2 Non-linear L2

Non-linear Lpips

Figure 7 Comparison of linear and nonlinear walks for the zoom operation The linear walk undershoots the targeted levelof transformation but maintains more realistic output

entire range of a given transformation then we are betterable to move the model distribution to both extremes Onthe other hand if the images for the given class exhibit lessvariability the extent of our transformation is limited by thisproperty of the dataset

43 Alternative architectures and walks

We ran an identical set of experiments using the nonlin-ear walk in the BigGAN latent space and obtain similarquantitative results To summarize the Pearsonrsquos correla-tion coefficient between dataset σ and model ∆micro for linearwalks and nonlinear walks is shown in Table 1 and fullresults in B42 Qualitatively we observe that while thelinear trajectory undershoots the targeted level of transfor-mation it is able to preserve more realistic-looking results(Fig 7) The transformations involve a trade-off betweenminimizing the loss and maintaining realistic output andwe hypothesize that the linear walk functions as an implicitregularizer that corresponds well with the inherent organi-zation of the latent space

Luminance Shift X Shift Y ZoomLinear 059 028 039 037

Non-linear 049 049 055 060

Table 1 Pearsonrsquos correlation coefficient between datasetσ and model ∆micro for measured attributes p-value for slopelt 0001 for all transformations

To test the generality of our findings across model archi-tecture we ran similar experiments on StyleGAN in whichthe latent space is divided into two spaces z andW As [13]notes that theW space is less entangled than z we apply thelinear walk to W and show results in Fig 8 and B5 Oneinteresting aspect of StyleGAN is that we can change colorwhile leaving other structure in the image unchanged Inother words while green faces do not naturally exist in thedataset the StyleGAN model is still able to generate themThis differs from the behavior of BigGAN where changing

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

- Luminance +

- Red +

- Green +

- Blue +Figure 8 Distribution for luminance transformation learnedfrom the StyleGAN cars generator and qualitative exam-ples of color transformations on various datasets usingStyleGAN

color results in different semantics in the image eg turn-ing a dormant volcano to an active one or a daytime sceneto a nighttime one StyleGAN however does not preservethe exact geometry of objects under other transformationseg zoom and shift (see Sec B5)

5 Conclusion

GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much

AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx

Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC

References[1] A Amini A Soleimany W Schwarting S Bhatia and

D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2

[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2

[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3

[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4

[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3

[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3

[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2

[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2

[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3

[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3

[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3

[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2

[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12

[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24

[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10

[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2

[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3

[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4

[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2

[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2

[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12

[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3

[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2

[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2

[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11

[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3

Appendix

A Method detailsA1 Optimization for the linear walk

We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model

A2 Implementation Details for Linear Walk

We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below

Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage

Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene

Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB

where we jointly learn the three walk directions wR wGand wB

Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget

Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing

A3 Linear NN(z) walk

Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z

Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is

edit(G(z) 2α) = edit(edit(G(z) α) α)

To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows

z1 = z + αF (z)

z2 = z1 + αF (z1)

However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α

z + 2αF (z) = z1 + αF (z1) (4)

Therefore

z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)

Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z

A4 Optimization for the non-linear walk

Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)

We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps

B Additional ExperimentsB1 Model and Data Distributions

How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data

B2 Quantifying Transformation Limits

We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases

Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent

trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)

B3 Detected Bounding Boxes

To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis

B4 Alternative Walks in BigGAN

B41 LPIPS objective

In the main text we learn the latent space walk w by mini-mizing the objective function

J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)

using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)

B42 Non-linear Walks

Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic

2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API

B43 Additional Qualitative Examples

We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively

B5 Walks in StyleGAN

We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses

in Figs 20 22 24

B6 Qualitative examples for additional transfor-mations

Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)

00 02 04 06 08 10Area

05

10

15

20

PD

F

105

data

model

00 02 04 06 08 10Center Y

0

1

2

PD

F

102

data

model

00 02 04 06 08 10Center X

00

05

10

15

20

PD

F

102

data

model

00 02 04 06 08 10Pixel Intensity

2

3

4

5

PD

F

103

data

model

ZoomShift YShift XLuminance

Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset

Luminance

Rotate 2D

Shift X Shift Y

Zoom

10 05 00 05 10crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

50 0 50crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

2 1 0 1 2log(crarr)

00

02

04

06

08

Per

ceptu

alD

ista

nce

Rotate 3D

500 250 0 250 500crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude

Figure 11 Bounding boxes for random selected classes using ImageNet training images

Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α

Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)

Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk

10 05 00 05 10crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100In

ters

ection

2 0 2log(crarr)

000

025

050

075

100

Inte

rsec

tion

020 025 030Data

01

02

03

04

05

Model

micro

010 015 020Data

00

02

04

Model

micro

005 010 015Data

01

02

03

04

Model

micro

02 03Data

02

04

06

08

Model

micro

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

10 05 00 05 10crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

2 0 2log(crarr)

20

30

40

50

FID

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25P

DF

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

p=049 p=049 p=055 p=060

Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation

- Color +- Zoom +

- Shift X + - Shift Y +

- Rotate 2D + - Rotate 3D +

Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 5: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

source image (shown in the center image of each transfor-mation) Interestingly several priors come into play whenlearning these image transformations When we shift adaisy downwards in the Y direction the model hallucinatesthat the sky exists on the top of the image However whenwe shift the daisy up the model inpaints the remainder ofthe image with grass When we alter the brightness of a im-age the model transitions between nighttime and daytimeThis suggests that the model is able to extrapolate from theoriginal source image and yet remain consistent with theimage context

We further ask the question how much are we able toachieve each transformation When we increase the stepsize of α we observe that the extent to which we canachieve each transformation is limited In Fig 3 we ob-serve two potential failure cases one in which the the im-age becomes unrealistic and the other in which the imagefails to transform any further When we try to zoom in ona Persian cat we observe that the cat no longer increases insize beyond some point In fact the size of the cat underthe latent space transformation consistently undershoots thetarget zoom On the other hand when we try to zoom out onthe cat we observe that it begins to fall off the image man-ifold and at some point is unable to become any smallerIndeed the perceptual distance (using LPIPS) between im-ages decreases as we push α towards the transformationlimits Similar trends hold with other transformations weare able to shift a lorikeet up and down to some degree untilthe transformation fails to have any further effect and yieldsunrealistic output and despite adjusting α on the rotationvector we are unable to rotate a pizza Are the limitationsto these transformations governed by the training datasetIn other words are our latent space walks limited becausein ImageNet photos the cats are mostly centered and takenwithin a certain size We seek to investigate and quantifythese biases in the next sections

An intriguing property of the learned trajectory is thatthe degree to which it affects the output depends on the im-age class In Fig 4 we investigate the impact of the walkfor different image categories under color transformationsBy stepping w in the direction of increasing redness in theimage we are able to successfully recolor a jellyfish but weare unable to change the color of a goldfinch ndash the goldfinchremains yellow and the only effect of w is to slightly red-den the background Likewise increasing the brightness ofa scene transforms an erupting volcano to a dormant vol-cano but does not have such effect on Alps which onlytransitions between night and day In the third example weuse our latent walk to turn red sports cars to blue When weapply this vector to firetrucks the only effect is to slightlychange the color of the sky Again perceptual distance overmultiple image samples per class confirms these qualitativeobservations a 2-sample t-test yields t = 2077 p lt 0001

for jellyfishgoldfinch t = 814 p lt 0001 for volcanoalpand t = 684 p lt 0001 for sports carfire engine Wehypothesize that the differential impact of the same trans-formation on various image classes relates to the variabilityin the underlying dataset Firetrucks only appear in redbut sports cars appear in a variety of colors Therefore ourcolor transformation is constrained by the biases exhibitedin individual classes in the dataset

- Blueness +

- Darkness +

- Redness +jellyfish goldfinch

00

05

10

Per

ceptu

alD

ista

nce

sports car fire engine00

02

04

Per

ceptu

alD

ista

nce

volcano alp00

05

Per

ceptu

alD

ista

nce

Figure 4 Each pair of rows shows how a single latent direc-tion w affects two different ImageNet classes We observethat the same direction affects different classes differentlyand in a way consistent with semantic priors (eg ldquoVolca-noesrdquo explode ldquoAlpsrdquo do not) Boxplots show the LPIPSperceptual distance before and after transformation for 200samples per class

Motivated by our qualitative observations in Fig 1Fig 3 and Fig 4 we next quantitatively investigate thedegree to which we are able to achieve certain transforma-tions

Limits of Transformations In quantifying the extent ofour transformations we are bounded by two opposing fac-tors how much we can transform the distribution while alsoremaining within the manifold of realistic looking images

For the color (luminance) transformation we observethat by changing α we shift the proportion of light anddark colored pixels and can do so without large increases inFID (Fig 5 first column) However the extent to which wecan move the distribution is limited and each transformeddistribution still retains some probability mass over the fullrange of pixel intensities

10 05 00 05 10crarr

00

05

10

Inte

rsec

tion

200 100 0 100 200crarr

00

05

10

Inte

rsec

tion

200 100 0 100 200crarr

00

05

10

Inte

rsec

tion

2 0 2log(crarr)

00

05

10

Inte

rsec

tion

00 02 04 06 08 10Pixel Intensity

000

025

050

075

100

125

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

20

25

30

PD

F

105

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

10 05 00 05 10crarr

20

40

FID

200 100 0 100 200crarr

20

40

FID

200 100 0 100 200crarr

20

40

FID

2 0 2log(crarr)

20

40

FID

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Pixel Intensity

000

025

050

075

100

125

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

20

25

30

PD

F

105

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

Figure 5 Quantifying the extent of transformations We compare the attributes of generated images under the raw modeloutput G(z) compared to the distribution under a learned transformation model(α) We measure the intersection betweenG(z) and model(α) and also compute the FID on the transformed image to limit our transformations to the natural imagemanifold

Using the shift vector we show that we are able to movethe distribution of the center object by changing α In theunderlying model the center coordinate of the object ismost concentrated at half of the image width and height butafter applying the shift in X and shift in Y transformationthe mode of the transformed distribution varies between 03and 07 of the image widthheight To quantify the extentto which the distribution changes we compute the area ofintersection between the original model distribution and thedistribution after applying the transformation and we ob-serve that the intersection decreases as we increase or de-crease the magnitude of α However our transformationsare limited to a certain extent ndash if we increase α beyond 150pixels for vertical shifts we start to generate unrealistic im-ages as evidenced by a sharp rise in FID and convergingmodes in the transformed distributions (Fig 5 columns 2 amp3)

We perform a similar procedure for the zoom operationby measuring the area of the bounding box for the detectedobject under different magnitudes of α Like shift we ob-serve that subsequent increases in α magnitude start to havesmaller and smaller effects on the mode of the resulting dis-tribution (Fig 5 last column) Past an 8x zoom in or outwe observe an increase in the FID of the transformed out-put signifying decreasing image quality Interestingly for

zoom the FID under zooming in and zooming out is anti-symmetric indicating that the success to which we are ableto achieve the zoom-in operation under the constraint of re-alistic looking images differs from that of zooming out

These trends are consistent with the plateau in trans-formation behavior that we qualitatively observe in Fig 3Although we can arbitrarily increase the α step size aftersome point we are unable to achieve further transformationand risk deviating from the natural image manifold

42 How does the data affect the transformations

How do the statistics of the data distribution affect theability to achieve these transformations In other words isthe extent to which we can transform each class as we ob-served in Fig 4 due to limited variability in the underlyingdataset for each class

One intuitive way of quantifying the amount of transfor-mation achieved dependent on the variability in the datasetis to measure the difference in transformed model meansmodel(+α) and model(-α) and compare it to thespread of the dataset distribution For each class we com-pute standard deviation of the dataset with respect to ourstatistic of interest (pixel RGB value for color and bound-ing box area and center value for zoom and shift transfor-mations respectively) We hypothesize that if the amount of

000 025 050 075 100

Area

0

2

4

P(A

)

105

000 025 050 075 100

Area

0

2

4

P(A

)

105

020 025 030

Data

00

01

02

03

Model

micro

Zoom

r = 037 p lt 0001

005 010 015

Data

00

01

02

03

Model

micro

Shift Y

r = 039 p lt 0001

010 015 020

Data

00

01

02

03

Model

micro

Shift X

r = 028 p lt 0001

020 025 030

Data

00

01

02

03

Model

micro

Luminance

r = 059 p lt 0001

- Zoom +

Rob

inLa

ptop

Figure 6 Understanding per-class biases We observe a correlation between the variability in the training data for ImageNetclasses and our ability to shift the distribution under latent space transformations Classes with low variability (eg robin)limit our ability to achieve desired transformations in comparison to classes with a broad dataset distribution (eg laptop)

transformation is biased depending on the image class wewill observe a correlation between the distance of the meanshifts and the standard deviation of the data distribution

More concretely we define the change in model meansunder a given transformation as

∆microk =∥∥microkmodel(+αlowast) minus microkmodel(-αlowast)

∥∥2 (3)

for a given class k under the optimal walk for each trans-formation wlowast under Eq 5 and we set αlowast to be largest andsmallest α values used in training We note that the degreeto which we achieve each transformation is a function of αso we use the same value of α across all classes ndash one thatis large enough to separate the means of microkmodel(αlowast) andmicrokmodel(-αlowast) under transformation but also for which theFID of the generated distribution remains below a thresholdT of generating reasonably realistic images (for our experi-ments we use T = 22)

We apply this measure in Fig 6 where on the x-axis weplot the standard deviation σ of the dataset and on the y-axis we show the model ∆micro under a +αlowast andminusαlowast transfor-mation as defined in Eq 3 We sample randomly from 100classes for the color zoom and shift transformations andgenerate 200 samples of each class under the positive andnegative transformations We use the same setup of draw-ing samples from the model and dataset and computing the

statistics for each transformation as described in Sec 41Indeed we find that the width of the dataset distribu-

tion captured by the standard deviation of random samplesdrawn from the dataset for each class is related to the extentto which we achieve each transformation For these trans-formations we observe a positive correlation between thespread of the dataset and the magnitude of ∆micro observedin the transformed model distributions Furthermore theslope of all observed trends differs significantly from zero(p lt 0001 for all transformations)

For the zoom transformation we show examples of twoextremes along the trend We show an example of theldquorobinrdquo class in which the spread σ in the dataset is low andsubsequently the separation ∆micro that we are able to achieveby applying +αlowast and minusαlowast transformations is limited Onthe other hand in the ldquolaptoprdquo class the underlying spreadin the dataset is broad ImageNet contains images of laptopsof various sizes and we are able to attain a wider shift in themodel distribution when taking a αlowast-sized step in the latentspace

From these results we conclude that the extent to whichwe achieve each transformation is related to the variabil-ity that inherently exists in the dataset Consistent with ourqualitative observations in Fig 4 we find that if the im-ages for a particular class have adequate coverage over the

- Zoom + - Zoom +2x 4x

Linear Lpips

8x1x05x025x0125x 2x 4x 8x1x05x025x0125x

Linear L2 Non-linear L2

Non-linear Lpips

Figure 7 Comparison of linear and nonlinear walks for the zoom operation The linear walk undershoots the targeted levelof transformation but maintains more realistic output

entire range of a given transformation then we are betterable to move the model distribution to both extremes Onthe other hand if the images for the given class exhibit lessvariability the extent of our transformation is limited by thisproperty of the dataset

43 Alternative architectures and walks

We ran an identical set of experiments using the nonlin-ear walk in the BigGAN latent space and obtain similarquantitative results To summarize the Pearsonrsquos correla-tion coefficient between dataset σ and model ∆micro for linearwalks and nonlinear walks is shown in Table 1 and fullresults in B42 Qualitatively we observe that while thelinear trajectory undershoots the targeted level of transfor-mation it is able to preserve more realistic-looking results(Fig 7) The transformations involve a trade-off betweenminimizing the loss and maintaining realistic output andwe hypothesize that the linear walk functions as an implicitregularizer that corresponds well with the inherent organi-zation of the latent space

Luminance Shift X Shift Y ZoomLinear 059 028 039 037

Non-linear 049 049 055 060

Table 1 Pearsonrsquos correlation coefficient between datasetσ and model ∆micro for measured attributes p-value for slopelt 0001 for all transformations

To test the generality of our findings across model archi-tecture we ran similar experiments on StyleGAN in whichthe latent space is divided into two spaces z andW As [13]notes that theW space is less entangled than z we apply thelinear walk to W and show results in Fig 8 and B5 Oneinteresting aspect of StyleGAN is that we can change colorwhile leaving other structure in the image unchanged Inother words while green faces do not naturally exist in thedataset the StyleGAN model is still able to generate themThis differs from the behavior of BigGAN where changing

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

- Luminance +

- Red +

- Green +

- Blue +Figure 8 Distribution for luminance transformation learnedfrom the StyleGAN cars generator and qualitative exam-ples of color transformations on various datasets usingStyleGAN

color results in different semantics in the image eg turn-ing a dormant volcano to an active one or a daytime sceneto a nighttime one StyleGAN however does not preservethe exact geometry of objects under other transformationseg zoom and shift (see Sec B5)

5 Conclusion

GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much

AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx

Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC

References[1] A Amini A Soleimany W Schwarting S Bhatia and

D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2

[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2

[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3

[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4

[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3

[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3

[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2

[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2

[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3

[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3

[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3

[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2

[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12

[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24

[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10

[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2

[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3

[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4

[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2

[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2

[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12

[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3

[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2

[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2

[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11

[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3

Appendix

A Method detailsA1 Optimization for the linear walk

We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model

A2 Implementation Details for Linear Walk

We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below

Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage

Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene

Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB

where we jointly learn the three walk directions wR wGand wB

Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget

Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing

A3 Linear NN(z) walk

Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z

Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is

edit(G(z) 2α) = edit(edit(G(z) α) α)

To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows

z1 = z + αF (z)

z2 = z1 + αF (z1)

However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α

z + 2αF (z) = z1 + αF (z1) (4)

Therefore

z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)

Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z

A4 Optimization for the non-linear walk

Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)

We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps

B Additional ExperimentsB1 Model and Data Distributions

How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data

B2 Quantifying Transformation Limits

We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases

Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent

trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)

B3 Detected Bounding Boxes

To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis

B4 Alternative Walks in BigGAN

B41 LPIPS objective

In the main text we learn the latent space walk w by mini-mizing the objective function

J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)

using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)

B42 Non-linear Walks

Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic

2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API

B43 Additional Qualitative Examples

We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively

B5 Walks in StyleGAN

We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses

in Figs 20 22 24

B6 Qualitative examples for additional transfor-mations

Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)

00 02 04 06 08 10Area

05

10

15

20

PD

F

105

data

model

00 02 04 06 08 10Center Y

0

1

2

PD

F

102

data

model

00 02 04 06 08 10Center X

00

05

10

15

20

PD

F

102

data

model

00 02 04 06 08 10Pixel Intensity

2

3

4

5

PD

F

103

data

model

ZoomShift YShift XLuminance

Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset

Luminance

Rotate 2D

Shift X Shift Y

Zoom

10 05 00 05 10crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

50 0 50crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

2 1 0 1 2log(crarr)

00

02

04

06

08

Per

ceptu

alD

ista

nce

Rotate 3D

500 250 0 250 500crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude

Figure 11 Bounding boxes for random selected classes using ImageNet training images

Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α

Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)

Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk

10 05 00 05 10crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100In

ters

ection

2 0 2log(crarr)

000

025

050

075

100

Inte

rsec

tion

020 025 030Data

01

02

03

04

05

Model

micro

010 015 020Data

00

02

04

Model

micro

005 010 015Data

01

02

03

04

Model

micro

02 03Data

02

04

06

08

Model

micro

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

10 05 00 05 10crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

2 0 2log(crarr)

20

30

40

50

FID

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25P

DF

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

p=049 p=049 p=055 p=060

Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation

- Color +- Zoom +

- Shift X + - Shift Y +

- Rotate 2D + - Rotate 3D +

Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 6: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

10 05 00 05 10crarr

00

05

10

Inte

rsec

tion

200 100 0 100 200crarr

00

05

10

Inte

rsec

tion

200 100 0 100 200crarr

00

05

10

Inte

rsec

tion

2 0 2log(crarr)

00

05

10

Inte

rsec

tion

00 02 04 06 08 10Pixel Intensity

000

025

050

075

100

125

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

20

25

30

PD

F

105

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

10 05 00 05 10crarr

20

40

FID

200 100 0 100 200crarr

20

40

FID

200 100 0 100 200crarr

20

40

FID

2 0 2log(crarr)

20

40

FID

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Pixel Intensity

000

025

050

075

100

125

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

20

25

30

PD

F

105

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

Figure 5 Quantifying the extent of transformations We compare the attributes of generated images under the raw modeloutput G(z) compared to the distribution under a learned transformation model(α) We measure the intersection betweenG(z) and model(α) and also compute the FID on the transformed image to limit our transformations to the natural imagemanifold

Using the shift vector we show that we are able to movethe distribution of the center object by changing α In theunderlying model the center coordinate of the object ismost concentrated at half of the image width and height butafter applying the shift in X and shift in Y transformationthe mode of the transformed distribution varies between 03and 07 of the image widthheight To quantify the extentto which the distribution changes we compute the area ofintersection between the original model distribution and thedistribution after applying the transformation and we ob-serve that the intersection decreases as we increase or de-crease the magnitude of α However our transformationsare limited to a certain extent ndash if we increase α beyond 150pixels for vertical shifts we start to generate unrealistic im-ages as evidenced by a sharp rise in FID and convergingmodes in the transformed distributions (Fig 5 columns 2 amp3)

We perform a similar procedure for the zoom operationby measuring the area of the bounding box for the detectedobject under different magnitudes of α Like shift we ob-serve that subsequent increases in α magnitude start to havesmaller and smaller effects on the mode of the resulting dis-tribution (Fig 5 last column) Past an 8x zoom in or outwe observe an increase in the FID of the transformed out-put signifying decreasing image quality Interestingly for

zoom the FID under zooming in and zooming out is anti-symmetric indicating that the success to which we are ableto achieve the zoom-in operation under the constraint of re-alistic looking images differs from that of zooming out

These trends are consistent with the plateau in trans-formation behavior that we qualitatively observe in Fig 3Although we can arbitrarily increase the α step size aftersome point we are unable to achieve further transformationand risk deviating from the natural image manifold

42 How does the data affect the transformations

How do the statistics of the data distribution affect theability to achieve these transformations In other words isthe extent to which we can transform each class as we ob-served in Fig 4 due to limited variability in the underlyingdataset for each class

One intuitive way of quantifying the amount of transfor-mation achieved dependent on the variability in the datasetis to measure the difference in transformed model meansmodel(+α) and model(-α) and compare it to thespread of the dataset distribution For each class we com-pute standard deviation of the dataset with respect to ourstatistic of interest (pixel RGB value for color and bound-ing box area and center value for zoom and shift transfor-mations respectively) We hypothesize that if the amount of

000 025 050 075 100

Area

0

2

4

P(A

)

105

000 025 050 075 100

Area

0

2

4

P(A

)

105

020 025 030

Data

00

01

02

03

Model

micro

Zoom

r = 037 p lt 0001

005 010 015

Data

00

01

02

03

Model

micro

Shift Y

r = 039 p lt 0001

010 015 020

Data

00

01

02

03

Model

micro

Shift X

r = 028 p lt 0001

020 025 030

Data

00

01

02

03

Model

micro

Luminance

r = 059 p lt 0001

- Zoom +

Rob

inLa

ptop

Figure 6 Understanding per-class biases We observe a correlation between the variability in the training data for ImageNetclasses and our ability to shift the distribution under latent space transformations Classes with low variability (eg robin)limit our ability to achieve desired transformations in comparison to classes with a broad dataset distribution (eg laptop)

transformation is biased depending on the image class wewill observe a correlation between the distance of the meanshifts and the standard deviation of the data distribution

More concretely we define the change in model meansunder a given transformation as

∆microk =∥∥microkmodel(+αlowast) minus microkmodel(-αlowast)

∥∥2 (3)

for a given class k under the optimal walk for each trans-formation wlowast under Eq 5 and we set αlowast to be largest andsmallest α values used in training We note that the degreeto which we achieve each transformation is a function of αso we use the same value of α across all classes ndash one thatis large enough to separate the means of microkmodel(αlowast) andmicrokmodel(-αlowast) under transformation but also for which theFID of the generated distribution remains below a thresholdT of generating reasonably realistic images (for our experi-ments we use T = 22)

We apply this measure in Fig 6 where on the x-axis weplot the standard deviation σ of the dataset and on the y-axis we show the model ∆micro under a +αlowast andminusαlowast transfor-mation as defined in Eq 3 We sample randomly from 100classes for the color zoom and shift transformations andgenerate 200 samples of each class under the positive andnegative transformations We use the same setup of draw-ing samples from the model and dataset and computing the

statistics for each transformation as described in Sec 41Indeed we find that the width of the dataset distribu-

tion captured by the standard deviation of random samplesdrawn from the dataset for each class is related to the extentto which we achieve each transformation For these trans-formations we observe a positive correlation between thespread of the dataset and the magnitude of ∆micro observedin the transformed model distributions Furthermore theslope of all observed trends differs significantly from zero(p lt 0001 for all transformations)

For the zoom transformation we show examples of twoextremes along the trend We show an example of theldquorobinrdquo class in which the spread σ in the dataset is low andsubsequently the separation ∆micro that we are able to achieveby applying +αlowast and minusαlowast transformations is limited Onthe other hand in the ldquolaptoprdquo class the underlying spreadin the dataset is broad ImageNet contains images of laptopsof various sizes and we are able to attain a wider shift in themodel distribution when taking a αlowast-sized step in the latentspace

From these results we conclude that the extent to whichwe achieve each transformation is related to the variabil-ity that inherently exists in the dataset Consistent with ourqualitative observations in Fig 4 we find that if the im-ages for a particular class have adequate coverage over the

- Zoom + - Zoom +2x 4x

Linear Lpips

8x1x05x025x0125x 2x 4x 8x1x05x025x0125x

Linear L2 Non-linear L2

Non-linear Lpips

Figure 7 Comparison of linear and nonlinear walks for the zoom operation The linear walk undershoots the targeted levelof transformation but maintains more realistic output

entire range of a given transformation then we are betterable to move the model distribution to both extremes Onthe other hand if the images for the given class exhibit lessvariability the extent of our transformation is limited by thisproperty of the dataset

43 Alternative architectures and walks

We ran an identical set of experiments using the nonlin-ear walk in the BigGAN latent space and obtain similarquantitative results To summarize the Pearsonrsquos correla-tion coefficient between dataset σ and model ∆micro for linearwalks and nonlinear walks is shown in Table 1 and fullresults in B42 Qualitatively we observe that while thelinear trajectory undershoots the targeted level of transfor-mation it is able to preserve more realistic-looking results(Fig 7) The transformations involve a trade-off betweenminimizing the loss and maintaining realistic output andwe hypothesize that the linear walk functions as an implicitregularizer that corresponds well with the inherent organi-zation of the latent space

Luminance Shift X Shift Y ZoomLinear 059 028 039 037

Non-linear 049 049 055 060

Table 1 Pearsonrsquos correlation coefficient between datasetσ and model ∆micro for measured attributes p-value for slopelt 0001 for all transformations

To test the generality of our findings across model archi-tecture we ran similar experiments on StyleGAN in whichthe latent space is divided into two spaces z andW As [13]notes that theW space is less entangled than z we apply thelinear walk to W and show results in Fig 8 and B5 Oneinteresting aspect of StyleGAN is that we can change colorwhile leaving other structure in the image unchanged Inother words while green faces do not naturally exist in thedataset the StyleGAN model is still able to generate themThis differs from the behavior of BigGAN where changing

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

- Luminance +

- Red +

- Green +

- Blue +Figure 8 Distribution for luminance transformation learnedfrom the StyleGAN cars generator and qualitative exam-ples of color transformations on various datasets usingStyleGAN

color results in different semantics in the image eg turn-ing a dormant volcano to an active one or a daytime sceneto a nighttime one StyleGAN however does not preservethe exact geometry of objects under other transformationseg zoom and shift (see Sec B5)

5 Conclusion

GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much

AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx

Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC

References[1] A Amini A Soleimany W Schwarting S Bhatia and

D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2

[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2

[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3

[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4

[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3

[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3

[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2

[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2

[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3

[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3

[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3

[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2

[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12

[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24

[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10

[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2

[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3

[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4

[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2

[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2

[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12

[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3

[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2

[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2

[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11

[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3

Appendix

A Method detailsA1 Optimization for the linear walk

We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model

A2 Implementation Details for Linear Walk

We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below

Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage

Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene

Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB

where we jointly learn the three walk directions wR wGand wB

Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget

Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing

A3 Linear NN(z) walk

Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z

Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is

edit(G(z) 2α) = edit(edit(G(z) α) α)

To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows

z1 = z + αF (z)

z2 = z1 + αF (z1)

However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α

z + 2αF (z) = z1 + αF (z1) (4)

Therefore

z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)

Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z

A4 Optimization for the non-linear walk

Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)

We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps

B Additional ExperimentsB1 Model and Data Distributions

How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data

B2 Quantifying Transformation Limits

We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases

Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent

trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)

B3 Detected Bounding Boxes

To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis

B4 Alternative Walks in BigGAN

B41 LPIPS objective

In the main text we learn the latent space walk w by mini-mizing the objective function

J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)

using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)

B42 Non-linear Walks

Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic

2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API

B43 Additional Qualitative Examples

We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively

B5 Walks in StyleGAN

We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses

in Figs 20 22 24

B6 Qualitative examples for additional transfor-mations

Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)

00 02 04 06 08 10Area

05

10

15

20

PD

F

105

data

model

00 02 04 06 08 10Center Y

0

1

2

PD

F

102

data

model

00 02 04 06 08 10Center X

00

05

10

15

20

PD

F

102

data

model

00 02 04 06 08 10Pixel Intensity

2

3

4

5

PD

F

103

data

model

ZoomShift YShift XLuminance

Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset

Luminance

Rotate 2D

Shift X Shift Y

Zoom

10 05 00 05 10crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

50 0 50crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

2 1 0 1 2log(crarr)

00

02

04

06

08

Per

ceptu

alD

ista

nce

Rotate 3D

500 250 0 250 500crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude

Figure 11 Bounding boxes for random selected classes using ImageNet training images

Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α

Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)

Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk

10 05 00 05 10crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100In

ters

ection

2 0 2log(crarr)

000

025

050

075

100

Inte

rsec

tion

020 025 030Data

01

02

03

04

05

Model

micro

010 015 020Data

00

02

04

Model

micro

005 010 015Data

01

02

03

04

Model

micro

02 03Data

02

04

06

08

Model

micro

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

10 05 00 05 10crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

2 0 2log(crarr)

20

30

40

50

FID

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25P

DF

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

p=049 p=049 p=055 p=060

Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation

- Color +- Zoom +

- Shift X + - Shift Y +

- Rotate 2D + - Rotate 3D +

Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 7: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

000 025 050 075 100

Area

0

2

4

P(A

)

105

000 025 050 075 100

Area

0

2

4

P(A

)

105

020 025 030

Data

00

01

02

03

Model

micro

Zoom

r = 037 p lt 0001

005 010 015

Data

00

01

02

03

Model

micro

Shift Y

r = 039 p lt 0001

010 015 020

Data

00

01

02

03

Model

micro

Shift X

r = 028 p lt 0001

020 025 030

Data

00

01

02

03

Model

micro

Luminance

r = 059 p lt 0001

- Zoom +

Rob

inLa

ptop

Figure 6 Understanding per-class biases We observe a correlation between the variability in the training data for ImageNetclasses and our ability to shift the distribution under latent space transformations Classes with low variability (eg robin)limit our ability to achieve desired transformations in comparison to classes with a broad dataset distribution (eg laptop)

transformation is biased depending on the image class wewill observe a correlation between the distance of the meanshifts and the standard deviation of the data distribution

More concretely we define the change in model meansunder a given transformation as

∆microk =∥∥microkmodel(+αlowast) minus microkmodel(-αlowast)

∥∥2 (3)

for a given class k under the optimal walk for each trans-formation wlowast under Eq 5 and we set αlowast to be largest andsmallest α values used in training We note that the degreeto which we achieve each transformation is a function of αso we use the same value of α across all classes ndash one thatis large enough to separate the means of microkmodel(αlowast) andmicrokmodel(-αlowast) under transformation but also for which theFID of the generated distribution remains below a thresholdT of generating reasonably realistic images (for our experi-ments we use T = 22)

We apply this measure in Fig 6 where on the x-axis weplot the standard deviation σ of the dataset and on the y-axis we show the model ∆micro under a +αlowast andminusαlowast transfor-mation as defined in Eq 3 We sample randomly from 100classes for the color zoom and shift transformations andgenerate 200 samples of each class under the positive andnegative transformations We use the same setup of draw-ing samples from the model and dataset and computing the

statistics for each transformation as described in Sec 41Indeed we find that the width of the dataset distribu-

tion captured by the standard deviation of random samplesdrawn from the dataset for each class is related to the extentto which we achieve each transformation For these trans-formations we observe a positive correlation between thespread of the dataset and the magnitude of ∆micro observedin the transformed model distributions Furthermore theslope of all observed trends differs significantly from zero(p lt 0001 for all transformations)

For the zoom transformation we show examples of twoextremes along the trend We show an example of theldquorobinrdquo class in which the spread σ in the dataset is low andsubsequently the separation ∆micro that we are able to achieveby applying +αlowast and minusαlowast transformations is limited Onthe other hand in the ldquolaptoprdquo class the underlying spreadin the dataset is broad ImageNet contains images of laptopsof various sizes and we are able to attain a wider shift in themodel distribution when taking a αlowast-sized step in the latentspace

From these results we conclude that the extent to whichwe achieve each transformation is related to the variabil-ity that inherently exists in the dataset Consistent with ourqualitative observations in Fig 4 we find that if the im-ages for a particular class have adequate coverage over the

- Zoom + - Zoom +2x 4x

Linear Lpips

8x1x05x025x0125x 2x 4x 8x1x05x025x0125x

Linear L2 Non-linear L2

Non-linear Lpips

Figure 7 Comparison of linear and nonlinear walks for the zoom operation The linear walk undershoots the targeted levelof transformation but maintains more realistic output

entire range of a given transformation then we are betterable to move the model distribution to both extremes Onthe other hand if the images for the given class exhibit lessvariability the extent of our transformation is limited by thisproperty of the dataset

43 Alternative architectures and walks

We ran an identical set of experiments using the nonlin-ear walk in the BigGAN latent space and obtain similarquantitative results To summarize the Pearsonrsquos correla-tion coefficient between dataset σ and model ∆micro for linearwalks and nonlinear walks is shown in Table 1 and fullresults in B42 Qualitatively we observe that while thelinear trajectory undershoots the targeted level of transfor-mation it is able to preserve more realistic-looking results(Fig 7) The transformations involve a trade-off betweenminimizing the loss and maintaining realistic output andwe hypothesize that the linear walk functions as an implicitregularizer that corresponds well with the inherent organi-zation of the latent space

Luminance Shift X Shift Y ZoomLinear 059 028 039 037

Non-linear 049 049 055 060

Table 1 Pearsonrsquos correlation coefficient between datasetσ and model ∆micro for measured attributes p-value for slopelt 0001 for all transformations

To test the generality of our findings across model archi-tecture we ran similar experiments on StyleGAN in whichthe latent space is divided into two spaces z andW As [13]notes that theW space is less entangled than z we apply thelinear walk to W and show results in Fig 8 and B5 Oneinteresting aspect of StyleGAN is that we can change colorwhile leaving other structure in the image unchanged Inother words while green faces do not naturally exist in thedataset the StyleGAN model is still able to generate themThis differs from the behavior of BigGAN where changing

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

- Luminance +

- Red +

- Green +

- Blue +Figure 8 Distribution for luminance transformation learnedfrom the StyleGAN cars generator and qualitative exam-ples of color transformations on various datasets usingStyleGAN

color results in different semantics in the image eg turn-ing a dormant volcano to an active one or a daytime sceneto a nighttime one StyleGAN however does not preservethe exact geometry of objects under other transformationseg zoom and shift (see Sec B5)

5 Conclusion

GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much

AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx

Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC

References[1] A Amini A Soleimany W Schwarting S Bhatia and

D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2

[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2

[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3

[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4

[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3

[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3

[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2

[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2

[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3

[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3

[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3

[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2

[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12

[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24

[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10

[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2

[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3

[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4

[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2

[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2

[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12

[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3

[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2

[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2

[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11

[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3

Appendix

A Method detailsA1 Optimization for the linear walk

We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model

A2 Implementation Details for Linear Walk

We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below

Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage

Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene

Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB

where we jointly learn the three walk directions wR wGand wB

Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget

Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing

A3 Linear NN(z) walk

Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z

Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is

edit(G(z) 2α) = edit(edit(G(z) α) α)

To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows

z1 = z + αF (z)

z2 = z1 + αF (z1)

However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α

z + 2αF (z) = z1 + αF (z1) (4)

Therefore

z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)

Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z

A4 Optimization for the non-linear walk

Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)

We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps

B Additional ExperimentsB1 Model and Data Distributions

How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data

B2 Quantifying Transformation Limits

We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases

Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent

trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)

B3 Detected Bounding Boxes

To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis

B4 Alternative Walks in BigGAN

B41 LPIPS objective

In the main text we learn the latent space walk w by mini-mizing the objective function

J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)

using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)

B42 Non-linear Walks

Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic

2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API

B43 Additional Qualitative Examples

We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively

B5 Walks in StyleGAN

We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses

in Figs 20 22 24

B6 Qualitative examples for additional transfor-mations

Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)

00 02 04 06 08 10Area

05

10

15

20

PD

F

105

data

model

00 02 04 06 08 10Center Y

0

1

2

PD

F

102

data

model

00 02 04 06 08 10Center X

00

05

10

15

20

PD

F

102

data

model

00 02 04 06 08 10Pixel Intensity

2

3

4

5

PD

F

103

data

model

ZoomShift YShift XLuminance

Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset

Luminance

Rotate 2D

Shift X Shift Y

Zoom

10 05 00 05 10crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

50 0 50crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

2 1 0 1 2log(crarr)

00

02

04

06

08

Per

ceptu

alD

ista

nce

Rotate 3D

500 250 0 250 500crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude

Figure 11 Bounding boxes for random selected classes using ImageNet training images

Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α

Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)

Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk

10 05 00 05 10crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100In

ters

ection

2 0 2log(crarr)

000

025

050

075

100

Inte

rsec

tion

020 025 030Data

01

02

03

04

05

Model

micro

010 015 020Data

00

02

04

Model

micro

005 010 015Data

01

02

03

04

Model

micro

02 03Data

02

04

06

08

Model

micro

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

10 05 00 05 10crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

2 0 2log(crarr)

20

30

40

50

FID

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25P

DF

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

p=049 p=049 p=055 p=060

Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation

- Color +- Zoom +

- Shift X + - Shift Y +

- Rotate 2D + - Rotate 3D +

Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 8: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

- Zoom + - Zoom +2x 4x

Linear Lpips

8x1x05x025x0125x 2x 4x 8x1x05x025x0125x

Linear L2 Non-linear L2

Non-linear Lpips

Figure 7 Comparison of linear and nonlinear walks for the zoom operation The linear walk undershoots the targeted levelof transformation but maintains more realistic output

entire range of a given transformation then we are betterable to move the model distribution to both extremes Onthe other hand if the images for the given class exhibit lessvariability the extent of our transformation is limited by thisproperty of the dataset

43 Alternative architectures and walks

We ran an identical set of experiments using the nonlin-ear walk in the BigGAN latent space and obtain similarquantitative results To summarize the Pearsonrsquos correla-tion coefficient between dataset σ and model ∆micro for linearwalks and nonlinear walks is shown in Table 1 and fullresults in B42 Qualitatively we observe that while thelinear trajectory undershoots the targeted level of transfor-mation it is able to preserve more realistic-looking results(Fig 7) The transformations involve a trade-off betweenminimizing the loss and maintaining realistic output andwe hypothesize that the linear walk functions as an implicitregularizer that corresponds well with the inherent organi-zation of the latent space

Luminance Shift X Shift Y ZoomLinear 059 028 039 037

Non-linear 049 049 055 060

Table 1 Pearsonrsquos correlation coefficient between datasetσ and model ∆micro for measured attributes p-value for slopelt 0001 for all transformations

To test the generality of our findings across model archi-tecture we ran similar experiments on StyleGAN in whichthe latent space is divided into two spaces z andW As [13]notes that theW space is less entangled than z we apply thelinear walk to W and show results in Fig 8 and B5 Oneinteresting aspect of StyleGAN is that we can change colorwhile leaving other structure in the image unchanged Inother words while green faces do not naturally exist in thedataset the StyleGAN model is still able to generate themThis differs from the behavior of BigGAN where changing

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

- Luminance +

- Red +

- Green +

- Blue +Figure 8 Distribution for luminance transformation learnedfrom the StyleGAN cars generator and qualitative exam-ples of color transformations on various datasets usingStyleGAN

color results in different semantics in the image eg turn-ing a dormant volcano to an active one or a daytime sceneto a nighttime one StyleGAN however does not preservethe exact geometry of objects under other transformationseg zoom and shift (see Sec B5)

5 Conclusion

GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much

AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx

Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC

References[1] A Amini A Soleimany W Schwarting S Bhatia and

D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2

[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2

[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3

[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4

[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3

[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3

[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2

[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2

[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3

[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3

[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3

[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2

[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12

[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24

[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10

[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2

[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3

[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4

[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2

[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2

[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12

[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3

[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2

[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2

[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11

[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3

Appendix

A Method detailsA1 Optimization for the linear walk

We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model

A2 Implementation Details for Linear Walk

We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below

Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage

Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene

Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB

where we jointly learn the three walk directions wR wGand wB

Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget

Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing

A3 Linear NN(z) walk

Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z

Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is

edit(G(z) 2α) = edit(edit(G(z) α) α)

To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows

z1 = z + αF (z)

z2 = z1 + αF (z1)

However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α

z + 2αF (z) = z1 + αF (z1) (4)

Therefore

z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)

Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z

A4 Optimization for the non-linear walk

Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)

We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps

B Additional ExperimentsB1 Model and Data Distributions

How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data

B2 Quantifying Transformation Limits

We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases

Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent

trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)

B3 Detected Bounding Boxes

To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis

B4 Alternative Walks in BigGAN

B41 LPIPS objective

In the main text we learn the latent space walk w by mini-mizing the objective function

J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)

using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)

B42 Non-linear Walks

Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic

2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API

B43 Additional Qualitative Examples

We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively

B5 Walks in StyleGAN

We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses

in Figs 20 22 24

B6 Qualitative examples for additional transfor-mations

Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)

00 02 04 06 08 10Area

05

10

15

20

PD

F

105

data

model

00 02 04 06 08 10Center Y

0

1

2

PD

F

102

data

model

00 02 04 06 08 10Center X

00

05

10

15

20

PD

F

102

data

model

00 02 04 06 08 10Pixel Intensity

2

3

4

5

PD

F

103

data

model

ZoomShift YShift XLuminance

Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset

Luminance

Rotate 2D

Shift X Shift Y

Zoom

10 05 00 05 10crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

50 0 50crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

2 1 0 1 2log(crarr)

00

02

04

06

08

Per

ceptu

alD

ista

nce

Rotate 3D

500 250 0 250 500crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude

Figure 11 Bounding boxes for random selected classes using ImageNet training images

Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α

Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)

Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk

10 05 00 05 10crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100In

ters

ection

2 0 2log(crarr)

000

025

050

075

100

Inte

rsec

tion

020 025 030Data

01

02

03

04

05

Model

micro

010 015 020Data

00

02

04

Model

micro

005 010 015Data

01

02

03

04

Model

micro

02 03Data

02

04

06

08

Model

micro

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

10 05 00 05 10crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

2 0 2log(crarr)

20

30

40

50

FID

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25P

DF

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

p=049 p=049 p=055 p=060

Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation

- Color +- Zoom +

- Shift X + - Shift Y +

- Rotate 2D + - Rotate 3D +

Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 9: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

5 Conclusion

GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much

AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx

Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC

References[1] A Amini A Soleimany W Schwarting S Bhatia and

D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2

[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2

[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3

[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4

[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3

[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3

[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2

[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2

[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3

[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3

[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3

[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2

[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12

[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24

[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10

[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2

[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3

[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4

[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2

[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2

[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12

[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3

[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2

[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2

[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11

[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3

Appendix

A Method detailsA1 Optimization for the linear walk

We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model

A2 Implementation Details for Linear Walk

We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below

Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage

Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene

Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB

where we jointly learn the three walk directions wR wGand wB

Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget

Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing

A3 Linear NN(z) walk

Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z

Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is

edit(G(z) 2α) = edit(edit(G(z) α) α)

To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows

z1 = z + αF (z)

z2 = z1 + αF (z1)

However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α

z + 2αF (z) = z1 + αF (z1) (4)

Therefore

z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)

Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z

A4 Optimization for the non-linear walk

Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)

We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps

B Additional ExperimentsB1 Model and Data Distributions

How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data

B2 Quantifying Transformation Limits

We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases

Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent

trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)

B3 Detected Bounding Boxes

To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis

B4 Alternative Walks in BigGAN

B41 LPIPS objective

In the main text we learn the latent space walk w by mini-mizing the objective function

J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)

using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)

B42 Non-linear Walks

Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic

2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API

B43 Additional Qualitative Examples

We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively

B5 Walks in StyleGAN

We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses

in Figs 20 22 24

B6 Qualitative examples for additional transfor-mations

Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)

00 02 04 06 08 10Area

05

10

15

20

PD

F

105

data

model

00 02 04 06 08 10Center Y

0

1

2

PD

F

102

data

model

00 02 04 06 08 10Center X

00

05

10

15

20

PD

F

102

data

model

00 02 04 06 08 10Pixel Intensity

2

3

4

5

PD

F

103

data

model

ZoomShift YShift XLuminance

Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset

Luminance

Rotate 2D

Shift X Shift Y

Zoom

10 05 00 05 10crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

50 0 50crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

2 1 0 1 2log(crarr)

00

02

04

06

08

Per

ceptu

alD

ista

nce

Rotate 3D

500 250 0 250 500crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude

Figure 11 Bounding boxes for random selected classes using ImageNet training images

Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α

Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)

Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk

10 05 00 05 10crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100In

ters

ection

2 0 2log(crarr)

000

025

050

075

100

Inte

rsec

tion

020 025 030Data

01

02

03

04

05

Model

micro

010 015 020Data

00

02

04

Model

micro

005 010 015Data

01

02

03

04

Model

micro

02 03Data

02

04

06

08

Model

micro

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

10 05 00 05 10crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

2 0 2log(crarr)

20

30

40

50

FID

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25P

DF

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

p=049 p=049 p=055 p=060

Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation

- Color +- Zoom +

- Shift X + - Shift Y +

- Rotate 2D + - Rotate 3D +

Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 10: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

Appendix

A Method detailsA1 Optimization for the linear walk

We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model

A2 Implementation Details for Linear Walk

We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below

Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage

Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene

Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB

where we jointly learn the three walk directions wR wGand wB

Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget

Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing

A3 Linear NN(z) walk

Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z

Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is

edit(G(z) 2α) = edit(edit(G(z) α) α)

To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows

z1 = z + αF (z)

z2 = z1 + αF (z1)

However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α

z + 2αF (z) = z1 + αF (z1) (4)

Therefore

z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)

Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z

A4 Optimization for the non-linear walk

Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)

We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps

B Additional ExperimentsB1 Model and Data Distributions

How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data

B2 Quantifying Transformation Limits

We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases

Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent

trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)

B3 Detected Bounding Boxes

To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis

B4 Alternative Walks in BigGAN

B41 LPIPS objective

In the main text we learn the latent space walk w by mini-mizing the objective function

J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)

using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)

B42 Non-linear Walks

Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic

2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API

B43 Additional Qualitative Examples

We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively

B5 Walks in StyleGAN

We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses

in Figs 20 22 24

B6 Qualitative examples for additional transfor-mations

Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)

00 02 04 06 08 10Area

05

10

15

20

PD

F

105

data

model

00 02 04 06 08 10Center Y

0

1

2

PD

F

102

data

model

00 02 04 06 08 10Center X

00

05

10

15

20

PD

F

102

data

model

00 02 04 06 08 10Pixel Intensity

2

3

4

5

PD

F

103

data

model

ZoomShift YShift XLuminance

Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset

Luminance

Rotate 2D

Shift X Shift Y

Zoom

10 05 00 05 10crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

50 0 50crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

2 1 0 1 2log(crarr)

00

02

04

06

08

Per

ceptu

alD

ista

nce

Rotate 3D

500 250 0 250 500crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude

Figure 11 Bounding boxes for random selected classes using ImageNet training images

Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α

Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)

Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk

10 05 00 05 10crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100In

ters

ection

2 0 2log(crarr)

000

025

050

075

100

Inte

rsec

tion

020 025 030Data

01

02

03

04

05

Model

micro

010 015 020Data

00

02

04

Model

micro

005 010 015Data

01

02

03

04

Model

micro

02 03Data

02

04

06

08

Model

micro

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

10 05 00 05 10crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

2 0 2log(crarr)

20

30

40

50

FID

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25P

DF

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

p=049 p=049 p=055 p=060

Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation

- Color +- Zoom +

- Shift X + - Shift Y +

- Rotate 2D + - Rotate 3D +

Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 11: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

A4 Optimization for the non-linear walk

Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)

We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps

B Additional ExperimentsB1 Model and Data Distributions

How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data

B2 Quantifying Transformation Limits

We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases

Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent

trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)

B3 Detected Bounding Boxes

To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis

B4 Alternative Walks in BigGAN

B41 LPIPS objective

In the main text we learn the latent space walk w by mini-mizing the objective function

J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)

using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)

B42 Non-linear Walks

Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic

2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API

B43 Additional Qualitative Examples

We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively

B5 Walks in StyleGAN

We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses

in Figs 20 22 24

B6 Qualitative examples for additional transfor-mations

Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)

00 02 04 06 08 10Area

05

10

15

20

PD

F

105

data

model

00 02 04 06 08 10Center Y

0

1

2

PD

F

102

data

model

00 02 04 06 08 10Center X

00

05

10

15

20

PD

F

102

data

model

00 02 04 06 08 10Pixel Intensity

2

3

4

5

PD

F

103

data

model

ZoomShift YShift XLuminance

Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset

Luminance

Rotate 2D

Shift X Shift Y

Zoom

10 05 00 05 10crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

50 0 50crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

2 1 0 1 2log(crarr)

00

02

04

06

08

Per

ceptu

alD

ista

nce

Rotate 3D

500 250 0 250 500crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude

Figure 11 Bounding boxes for random selected classes using ImageNet training images

Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α

Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)

Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk

10 05 00 05 10crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100In

ters

ection

2 0 2log(crarr)

000

025

050

075

100

Inte

rsec

tion

020 025 030Data

01

02

03

04

05

Model

micro

010 015 020Data

00

02

04

Model

micro

005 010 015Data

01

02

03

04

Model

micro

02 03Data

02

04

06

08

Model

micro

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

10 05 00 05 10crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

2 0 2log(crarr)

20

30

40

50

FID

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25P

DF

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

p=049 p=049 p=055 p=060

Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation

- Color +- Zoom +

- Shift X + - Shift Y +

- Rotate 2D + - Rotate 3D +

Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 12: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

B43 Additional Qualitative Examples

We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively

B5 Walks in StyleGAN

We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses

in Figs 20 22 24

B6 Qualitative examples for additional transfor-mations

Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)

00 02 04 06 08 10Area

05

10

15

20

PD

F

105

data

model

00 02 04 06 08 10Center Y

0

1

2

PD

F

102

data

model

00 02 04 06 08 10Center X

00

05

10

15

20

PD

F

102

data

model

00 02 04 06 08 10Pixel Intensity

2

3

4

5

PD

F

103

data

model

ZoomShift YShift XLuminance

Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset

Luminance

Rotate 2D

Shift X Shift Y

Zoom

10 05 00 05 10crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

50 0 50crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

400 200 0 200 400crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

2 1 0 1 2log(crarr)

00

02

04

06

08

Per

ceptu

alD

ista

nce

Rotate 3D

500 250 0 250 500crarr

00

02

04

06

08

Per

ceptu

alD

ista

nce

Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude

Figure 11 Bounding boxes for random selected classes using ImageNet training images

Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α

Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)

Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk

10 05 00 05 10crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100In

ters

ection

2 0 2log(crarr)

000

025

050

075

100

Inte

rsec

tion

020 025 030Data

01

02

03

04

05

Model

micro

010 015 020Data

00

02

04

Model

micro

005 010 015Data

01

02

03

04

Model

micro

02 03Data

02

04

06

08

Model

micro

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

10 05 00 05 10crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

2 0 2log(crarr)

20

30

40

50

FID

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25P

DF

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

p=049 p=049 p=055 p=060

Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation

- Color +- Zoom +

- Shift X + - Shift Y +

- Rotate 2D + - Rotate 3D +

Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 13: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

Figure 11 Bounding boxes for random selected classes using ImageNet training images

Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α

Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)

Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk

10 05 00 05 10crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100In

ters

ection

2 0 2log(crarr)

000

025

050

075

100

Inte

rsec

tion

020 025 030Data

01

02

03

04

05

Model

micro

010 015 020Data

00

02

04

Model

micro

005 010 015Data

01

02

03

04

Model

micro

02 03Data

02

04

06

08

Model

micro

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

10 05 00 05 10crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

2 0 2log(crarr)

20

30

40

50

FID

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25P

DF

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

p=049 p=049 p=055 p=060

Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation

- Color +- Zoom +

- Shift X + - Shift Y +

- Rotate 2D + - Rotate 3D +

Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 14: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)

Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk

10 05 00 05 10crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100In

ters

ection

2 0 2log(crarr)

000

025

050

075

100

Inte

rsec

tion

020 025 030Data

01

02

03

04

05

Model

micro

010 015 020Data

00

02

04

Model

micro

005 010 015Data

01

02

03

04

Model

micro

02 03Data

02

04

06

08

Model

micro

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

10 05 00 05 10crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

2 0 2log(crarr)

20

30

40

50

FID

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25P

DF

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

p=049 p=049 p=055 p=060

Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation

- Color +- Zoom +

- Shift X + - Shift Y +

- Rotate 2D + - Rotate 3D +

Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 15: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

10 05 00 05 10crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100

Inte

rsec

tion

200 100 0 100 200crarr

000

025

050

075

100In

ters

ection

2 0 2log(crarr)

000

025

050

075

100

Inte

rsec

tion

020 025 030Data

01

02

03

04

05

Model

micro

010 015 020Data

00

02

04

Model

micro

005 010 015Data

01

02

03

04

Model

micro

02 03Data

02

04

06

08

Model

micro

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

10 05 00 05 10crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

200 100 0 100 200crarr

20

30

40

50

FID

2 0 2log(crarr)

20

30

40

50

FID

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

00 02 04 06 08 10Pixel Intensity

00

05

10

15

20

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

00 02 04 06 08 10Center X

00

05

10

15

20

25P

DF

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Center Y

0

1

2

3

4

PD

F

102

model

crarr=-200

crarr=-150

crarr=-100

crarr=-50

crarr=50

crarr=100

crarr=150

crarr=200

00 02 04 06 08 10Area

00

05

10

15

PD

F

104

model

crarr=00625

crarr=0125

crarr=025

crarr=05

crarr=20

crarr=40

crarr=80

crarr=160

Luminance Shift X Shift Y Zoom

p=049 p=049 p=055 p=060

Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation

- Color +- Zoom +

- Shift X + - Shift Y +

- Rotate 2D + - Rotate 3D +

Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 16: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

- Color +- Zoom +

- Shift X + - Shift Y +

- Rotate 2D + - Rotate 3D +

Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 17: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 18: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 19: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 20: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

00 05 10Luminance

05

10

PD

F

102

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

40

FID

00 05 10ShiftX

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

25

50

75

FID

00 05 10ShiftY

00

05

10

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

0

200F

ID

00 05 10Zoom

0

2

4

6

PD

F

106

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

1 0 1crarr

25

50

75

FID

000 025 050 075 100ShiftY

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100ShiftX

000

025

050

075

100

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance Shift X Shift Y Zoom

Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 21: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 22: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

1 0 1crarr

20

40

FID

00 05 10Luminance

2

3

4

5

PD

F

103

data

model

1 0 1crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

20

25

FID

00 05 10ShiftX

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

100 0 100crarr

20

30

40

FID

00 05 10ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

1 0 1log(crarr)

50

100

FID

00 05 10Zoom

05

10

15

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

000 025 050 075 100ShiftX

00

05

10

15

20

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100ShiftY

0

1

2

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

0

2

4

6

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Luminance Shift X Shift Y Zoom

Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 23: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

- Rotate 2D +

- Shift X +

- Zoom +

- Rotate 3D +

- Shift Y +

- Color +

Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 24: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

04 06 08ShiftX

0

2

4

6P

DF

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Shift X Shift Y Zoom

1 0 1crarr

50

100

150

FID

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

100 0 100crarr

50

100

FID

04 06 08ShiftX

0

2

4

PD

F

102

data

model

100 0 100crarr

00

05

10

Inte

rsec

tion

100 0 100crarr

50

100

FID

04 06 08ShiftY

0

1

2

PD

F

102

data

model

100 0 100crarr

00

05

10In

ters

ecti

on

1 0 1log(crarr)

100

200

300

FID

00 05 10Zoom

00

05

10

PD

F

105

data

model

1 0 1log(crarr)

00

05

10

Inte

rsec

tion

1 0 1crarr

00

05

10

Inte

rsec

tion

00 05 10Luminance

2

4

PD

F

103

data

model

000 025 050 075 100Luminance

00

05

10

15

PD

F

102

model

crarr=-10

crarr=-075

crarr=-05

crarr=-025

crarr=025

crarr=05

crarr=075

crarr=10

Luminance

04 06 08ShiftX

0

2

4

6

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

04 06 08ShiftY

0

2

4

PD

F

102

model

crarr=-1000

crarr=-750

crarr=-500

crarr=-250

crarr=250

crarr=500

crarr=750

crarr=1000

000 025 050 075 100Zoom

00

05

10

15

20

PD

F

105

model

crarr=025

crarr=035

crarr=050

crarr=071

crarr=141

crarr=200

crarr=283

crarr=400

Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center

Page 25: arXiv:1907.07171v1 [cs.CV] 16 Jul 2019 · 2019. 7. 17. · target images. In addition to linear walks, we explore us-ing non-linear walks to achieve camera motion and color transformations

- Contrast +- Color +

Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column

Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center