Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
On the ldquosteerabilityrdquo of generative adversarial networks
Ali Jahanian Lucy Chai Phillip IsolaMIT CSAIL
jahanianlrchaiphillipimitedu
- Zoom +
- Shift Y +
- Shift X +
- Brightness +
- Rotate2D +
- Rotate3D +
Figure 1 Learned latent space trajectories in generative adversarial networks correspond to visual transformations like camerashift and zoom These transformations can change the distribution of generated data but only so much ndash biases in the datalike centered objects reveal themselves as objects get ldquostuckrdquo at the image borders when we try to shift them out of frameTake the ldquosteering wheelrdquo drive in the latent space and explore the natural image manifold via generative transformations
Abstract
An open secret in contemporary machine learning is thatmany models work beautifully on standard benchmarks butfail to generalize outside the lab This has been attributedto training on biased data which provide poor coverageover real world events Generative models are no excep-tion but recent advances in generative adversarial net-works (GANs) suggest otherwise ndash these models can nowsynthesize strikingly realistic and diverse images Is gen-erative modeling of photos a solved problem We showthat although current GANs can fit standard datasets verywell they still fall short of being comprehensive models ofthe visual manifold In particular we study their ability tofit simple transformations such as camera movements andcolor changes We find that the models reflect the biases ofthe datasets on which they are trained (eg centered ob-jects) but that they also exhibit some capacity for gener-alization by ldquosteeringrdquo in latent space we can shift thedistribution while still creating realistic images We hypoth-esize that the degree of distributional shift is related to thebreadth of the training data distribution and conduct ex-periments that demonstrate this Code is released on ourproject page httpsali-designgithubiogan_steerability
1 Introduction
The quality of deep generative models has increaseddramatically over the past few years When introducedin 2014 Generative Adversarial Networks (GANs) couldonly synthesize MNIST digits and low-resolution grayscalefaces [10] The most recent models on the other hand pro-duce diverse and high-resolution images that are often in-distinguishable from natural photos [4 13]
Science fiction has long dreamed of virtual realities filledof synthetic content as rich or richer than the real world(eg The Matrix Ready Player One) How close are weto this dream Traditional computer graphics can renderphotorealistic 3D scenes but has no way to automaticallygenerate detailed content Generative models like GANs incontrast can hallucinate content from scratch but we do notcurrently have tools for navigating the generated scenes inthe same kind of way as you can walk through and interactwith a 3D game engine
In this paper we explore the degree to which you cannavigate the visual world of a GAN Figure 1 illustratesthe kinds of transformations we explore Consider the dogat the top-left By moving in some direction of GAN la-tent space can we hallucinate walking toward this dogAs the figure indicates and as we will show in this paperthe answer is yes However as we continue to zoom in
1
arX
iv1
907
0717
1v1
[cs
CV
] 1
6 Ju
l 201
9
we quickly reach limits The direction in latent space thatzooms toward the dog can only make the dog so big Oncethe dog face fills the full frame continuing to walk in thisdirection fails to increase the zoom A similar effect occursin the daisy example (row 2 of Fig 1) we can find a di-rection in latent space that shifts the daisy up or down butthe daisy gets ldquostuckrdquo at the edge of the frame ndash continuingto walk in this direction does not shift the daisy out of theframe
We hypothesize that these limits are due to biases in thedistribution of images on which the GAN is trained Forexample if dogs and daisies tend to be centered and inframe in the dataset the same may be the case in the GAN-generated images as well Nonetheless we find that somedegree of transformation is possible we can still shift thegenerated daisies up and down in the frame When and whycan we achieve certain transformations but not others
This paper seeks to quantify the degree to which we canachieve basic visual transformations by navigating in GANlatent space In other words are GANs ldquosteerablerdquo in la-tent space1 We further analyze the relationship betweenthe data distribution on which the model is trained and thesuccess in achieving these transformations We find that itis possible to shift the distribution of generated images tosome degree but cannot extrapolate entirely out of the sup-port of the training data In particular we find that attributescan be shifted in proportion to the variability of that attributein the training data
One of the current criticisms of generative models is thatthey simply interpolate between datapoints and fail to gen-erate anything truly new Our results add nuance to thisstory It is possible to achieve distributional shift where themodel generates realistic images from a different distribu-tion than that on which it was trained However the abilityto steer the model in this way depends on the data distribu-tion having sufficient diversity along the dimension we aresteering
Our main contributions are
bull We show a simple walk in the z space of GANs canachieve camera motion and color transformations inthe output space in self-supervised manner without la-beled attributes or distinct source and target images
bull We show this linear walk is as effective as more com-plex non-linear walks suggesting that GANs modelslearn to linearize these operations without being ex-plicitly trained to do so
bull We design measures to quantify the extent and limitsof the transformations generative models can achieveand show experimental evidence We further show the
1We use the term ldquosteerablerdquo in analogy to the classic steerable filtersof Freeman amp Adelson [7]
relation between shiftability of model distribution andthe variability in datasets
bull We demonstrate these transformations as a general-purpose framework on different model architectureseg BigGAN and StyleGAN illustrating differentdisentanglement properties in their respective latentspaces
2 Related workWalking in the latent space can be seen from different
perspectives how to achieve it what limits it and whatdoes it enable us to do Our work addresses these threeaspects together and we briefly refer to each one in priorwork
Interpolations in GAN latent space Traditional ap-proaches to image editing in latent space involve findinglinear directions in GAN latent space that correspond tochanges in labeled attributes such as smile-vectors andgender-vectors for faces [20 13] However smoothly vary-ing latent space interpolations are not exclusive to GANsin flow-based generative models linear interpolations be-tween encoded images allow one to edit a source image to-ward attributes contained in a separate target [16] WithDeep Feature Interpolation [24] one can remove the gener-ative model entirely Instead the interpolation operation isperformed on an intermediate feature space of a pretrainedclassifier again using the feature mappings of source andtarget sets to determine an edit direction Unlike these ap-proaches we learn our latent-space trajectories in a self-supervised manner without relying on labeled attributes ordistinct source and target images We measure the edit dis-tance in pixel space rather than latent space because ourdesired target image may not directly exist in the latentspace of the generative model Despite allowing for non-linear trajectories we find that linear trajectories in the la-tent space model simple image manipulations ndash eg zoom-vectors and shift-vectors ndash surprisingly well even whenmodels were not explicitly trained to exhibit this propertyin latent space
Dataset bias Biases from training data and network ar-chitecture both factor into the generalization capacity oflearned models [23 8 1] Dataset biases partly comes fromhuman preferences in taking photos we typically captureimages in specific ldquocanonicalrdquo views that are not fully rep-resentative of the entire visual world [19 12] When mod-els are trained to fit these datasets they inherit the biasesin the data Such biases may result in models that misrepre-sent the given task ndash such as tendencies towards texture biasrather than shape bias on ImageNet classifiers [8] ndash whichin turn limits their generalization performance on similar
objectives [2] Our latent space trajectories transform theoutput corresponding to various camera motion and imageediting operations but ultimately we are constrained by bi-ases in the data and cannot extrapolate arbitrarily far beyondthe dataset
Deep learning for content creation The recent progressin generative models has enabled interesting applicationsfor content creation [4 13] including variants that en-able end users to control and fine-tune the generated out-put [22 26 3] A by-product the current work is to furtherenable users to modify various image properties by turninga single knob ndash the magnitude of the learned transformationSimilar to [26] we show that GANs allow users to achievebasic image editing operations by manipulating the latentspace However we further demonstrate that these imagemanipulations are not just a simple creativity tool they alsoprovide us with a window into biases and generalization ca-pacity of these models
Concurrent work We note a few concurrent papers thatalso explore trajectories in GAN latent space [6] learns lin-ear walks in the latent space that correspond to various fa-cial characteristics they use these walks to measure biasesin facial attribute detectors whereas we study biases in thegenerative model that originate from training data [21] alsoassumes linear latent space trajectories and learns paths forface attribute editing according to semantic concepts suchas age and expression thus demonstrating disentanglementproperties of the latent space [9] applies a linear walk toachieve transformations in learning and editing features thatpertain cognitive properties of an image such as memorabil-ity aesthetics and emotional valence Unlike these workswe do not require an attribute detector or assessor functionto learn the latent space trajectory and therefore our lossfunction is based on image similarity between source andtarget images In addition to linear walks we explore us-ing non-linear walks to achieve camera motion and colortransformations
3 Method
Generative models such as GANs [10] learn a mappingfunction G such that G z rarr x Here z is the latentcode drawn from a Gaussian density and x is an outputeg an image Our goal is to achieve transformations inthe output space by walking in the latent space as shownin Fig 2 In general this goal also captures the idea inequivariance where transformations in the input space resultin equivalent transformations in the output space (eg referto [11 5 17])
z
αw
Latent Space Generated Image
edit(G(z α))
G(z)
Target Image
G(z + αw)
Z X
z
f(z)
Non-Linear walk Linear walk
z + αwf( f( f( f(z))))
Z
OR
Optimal path
Figure 2 We aim to find a path in z space to transform thegenerated image G(z) to its edited version edit(G(z α))eg an αtimes zoom This walk results in the generated imageG(z+αw) when we choose a linear walk orG(f(f((z)))when we choose a non-linear walk
Objective How can we maneuver in the latent space todiscover different transformations on a given image Wedesire to learn a vector representing the optimal path forthis latent space transformation which we parametrize asan N -dimensional learned vector We weight this vector bya continuous parameter α which signifies the step size ofthe transformation large α values correspond to a greaterdegree of transformation while small α values correspondto a lesser degree Formally we learn the walk w by mini-mizing the objective function
wlowast = argminw
Ezα[L(G(z+αw)edit(G(z) α))] (1)
Here L measures the distance between the generated im-age after taking an α-step in the latent direction G(z+αw)and the target edit(G(z) α) derived from the source im-age G(z) We use L2 loss as our objective L however wealso obtain similar results when using the LPIPS perceptualimage similarity metric [25] (see B41) Note that we canlearn this walk in a fully self-supervised manner ndash we per-form our desired transformation on an arbitrary generatedimage and subsequently we optimize our walk to satisfythe objective We learn the optimal walk wlowast to model thetransformation applied in edit(middot) Let model(α) de-note such transformation in the direction of wlowast controlledwith the variable step size α defined as model(α) =G(z + αwlowast)
The previous setup assumes linear latent space walks butwe can also optimize for non-linear trajectories in whichthe walk direction depends on the current latent spaceposition For the non-linear walk we learn a functionflowast(z) which corresponds to a small ε-step transforma-tion edit(G(z) ε) To achieve bigger transformations welearn to apply f recursively mimicking discrete Euler ODEapproximations Formally for a fixed ε we minimize
L = Ezn[||fn(z)minus edit(G(z) nε))||] (2)
where fn(middot) is an nth-order function compositionf(f(f())) and f(z) is parametrized with a neural net-work We discuss further implementation details in A4
- Rotate 2D +058 005005
- Shift Y +058 013021
032 003001- Zoom +
Figure 3 Transformation limits As we increase the magnitude of the learned walk vector we reach a limit to how muchwe can achieve a given transformation The learned operation either ceases to transform the image any further or the imagestarts to deviate from the natural image manifold Below each figure we also indicate the average LPIPS perceptual distancebetween 200 sampled image pairs of that category Perceptual distance decreases as we move farther from the source (centerimage) which indicates that the images are converging
We use this function composition approach rather than thesimpler setup ofG(z+αNN(z)) because the latter learns toignore the input z when α takes on continuous values andis thus equivalent to the previous linear trajectory (see A3for further details)
Quantifying Steerability We further seek to quantify theextent to which we are able to achieve the desired imagemanipulations with respect to each transformation To thisend we compare the distribution of a given attribute egldquoluminancerdquo in the dataset versus in images generated afterwalking in latent space
For color transformations we consider the effect of in-creasing or decreasing the α coefficient corresponding toeach color channel To estimate the color distribution ofmodel-generated images we randomly sample N = 100pixels per image both before and after taking a step in latentspace Then we compute the pixel value for each channelor the mean RGB value for luminance and normalize therange between 0 and 1
For zoom and shift transformations we rely on an ob-ject detector which captures the central object in the imageclass We use a MobileNet-SSD v1 [18] detector to esti-mate object bounding boxes and restrict ourselves to imageclasses recognizable by the detector For each successfuldetection we take the highest probability bounding box cor-responding to the desired class and use that to quantify thedegree to which we are able to achieve each transformationFor the zoom operation we use the area of the boundingbox normalized by the area of the total image For shift in
the X and Y directions we take the center X and Y coordi-nates of the bounding box and normalize by image widthor height
Truncation parameters in GANs (as used in [4 13]) tradeoff between variety of the generated images and samplequality When comparing generated images to the datasetdistribution we use the largest possible truncation we usethe largest possible truncation for the model and performsimilar cropping and resizing of the dataset as done dur-ing model training (see [4]) When comparing the attributesof generated distributions under different α magnitudes toeach other but not to the dataset we reduce truncation to05 to ensure better performance of the object detector
4 ExperimentsWe demonstrate our approach using BigGAN [4] a
class-conditional GAN trained on 1000 ImageNet cate-gories We learn a shared latent space walk by averagingacross the image categories and further quantify how thiswalk affects each class differently We focus on linear walksin latent space for the main text and show additional resultson nonlinear walks in Sec 43 and B42 We also conductexperiments on StyleGAN [13] which uses an uncondi-tional style-based generator architecture in Sec 43 and B5
41 What image transformations can we achieve bysteering in the latent space
We show qualitative results of the learned transforma-tions in Fig 1 By walking in the generator latent spacewe are able to learn a variety of transformations on a given
source image (shown in the center image of each transfor-mation) Interestingly several priors come into play whenlearning these image transformations When we shift adaisy downwards in the Y direction the model hallucinatesthat the sky exists on the top of the image However whenwe shift the daisy up the model inpaints the remainder ofthe image with grass When we alter the brightness of a im-age the model transitions between nighttime and daytimeThis suggests that the model is able to extrapolate from theoriginal source image and yet remain consistent with theimage context
We further ask the question how much are we able toachieve each transformation When we increase the stepsize of α we observe that the extent to which we canachieve each transformation is limited In Fig 3 we ob-serve two potential failure cases one in which the the im-age becomes unrealistic and the other in which the imagefails to transform any further When we try to zoom in ona Persian cat we observe that the cat no longer increases insize beyond some point In fact the size of the cat underthe latent space transformation consistently undershoots thetarget zoom On the other hand when we try to zoom out onthe cat we observe that it begins to fall off the image man-ifold and at some point is unable to become any smallerIndeed the perceptual distance (using LPIPS) between im-ages decreases as we push α towards the transformationlimits Similar trends hold with other transformations weare able to shift a lorikeet up and down to some degree untilthe transformation fails to have any further effect and yieldsunrealistic output and despite adjusting α on the rotationvector we are unable to rotate a pizza Are the limitationsto these transformations governed by the training datasetIn other words are our latent space walks limited becausein ImageNet photos the cats are mostly centered and takenwithin a certain size We seek to investigate and quantifythese biases in the next sections
An intriguing property of the learned trajectory is thatthe degree to which it affects the output depends on the im-age class In Fig 4 we investigate the impact of the walkfor different image categories under color transformationsBy stepping w in the direction of increasing redness in theimage we are able to successfully recolor a jellyfish but weare unable to change the color of a goldfinch ndash the goldfinchremains yellow and the only effect of w is to slightly red-den the background Likewise increasing the brightness ofa scene transforms an erupting volcano to a dormant vol-cano but does not have such effect on Alps which onlytransitions between night and day In the third example weuse our latent walk to turn red sports cars to blue When weapply this vector to firetrucks the only effect is to slightlychange the color of the sky Again perceptual distance overmultiple image samples per class confirms these qualitativeobservations a 2-sample t-test yields t = 2077 p lt 0001
for jellyfishgoldfinch t = 814 p lt 0001 for volcanoalpand t = 684 p lt 0001 for sports carfire engine Wehypothesize that the differential impact of the same trans-formation on various image classes relates to the variabilityin the underlying dataset Firetrucks only appear in redbut sports cars appear in a variety of colors Therefore ourcolor transformation is constrained by the biases exhibitedin individual classes in the dataset
- Blueness +
- Darkness +
- Redness +jellyfish goldfinch
00
05
10
Per
ceptu
alD
ista
nce
sports car fire engine00
02
04
Per
ceptu
alD
ista
nce
volcano alp00
05
Per
ceptu
alD
ista
nce
Figure 4 Each pair of rows shows how a single latent direc-tion w affects two different ImageNet classes We observethat the same direction affects different classes differentlyand in a way consistent with semantic priors (eg ldquoVolca-noesrdquo explode ldquoAlpsrdquo do not) Boxplots show the LPIPSperceptual distance before and after transformation for 200samples per class
Motivated by our qualitative observations in Fig 1Fig 3 and Fig 4 we next quantitatively investigate thedegree to which we are able to achieve certain transforma-tions
Limits of Transformations In quantifying the extent ofour transformations we are bounded by two opposing fac-tors how much we can transform the distribution while alsoremaining within the manifold of realistic looking images
For the color (luminance) transformation we observethat by changing α we shift the proportion of light anddark colored pixels and can do so without large increases inFID (Fig 5 first column) However the extent to which wecan move the distribution is limited and each transformeddistribution still retains some probability mass over the fullrange of pixel intensities
10 05 00 05 10crarr
00
05
10
Inte
rsec
tion
200 100 0 100 200crarr
00
05
10
Inte
rsec
tion
200 100 0 100 200crarr
00
05
10
Inte
rsec
tion
2 0 2log(crarr)
00
05
10
Inte
rsec
tion
00 02 04 06 08 10Pixel Intensity
000
025
050
075
100
125
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
20
25
30
PD
F
105
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
10 05 00 05 10crarr
20
40
FID
200 100 0 100 200crarr
20
40
FID
200 100 0 100 200crarr
20
40
FID
2 0 2log(crarr)
20
40
FID
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Pixel Intensity
000
025
050
075
100
125
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
20
25
30
PD
F
105
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
Figure 5 Quantifying the extent of transformations We compare the attributes of generated images under the raw modeloutput G(z) compared to the distribution under a learned transformation model(α) We measure the intersection betweenG(z) and model(α) and also compute the FID on the transformed image to limit our transformations to the natural imagemanifold
Using the shift vector we show that we are able to movethe distribution of the center object by changing α In theunderlying model the center coordinate of the object ismost concentrated at half of the image width and height butafter applying the shift in X and shift in Y transformationthe mode of the transformed distribution varies between 03and 07 of the image widthheight To quantify the extentto which the distribution changes we compute the area ofintersection between the original model distribution and thedistribution after applying the transformation and we ob-serve that the intersection decreases as we increase or de-crease the magnitude of α However our transformationsare limited to a certain extent ndash if we increase α beyond 150pixels for vertical shifts we start to generate unrealistic im-ages as evidenced by a sharp rise in FID and convergingmodes in the transformed distributions (Fig 5 columns 2 amp3)
We perform a similar procedure for the zoom operationby measuring the area of the bounding box for the detectedobject under different magnitudes of α Like shift we ob-serve that subsequent increases in α magnitude start to havesmaller and smaller effects on the mode of the resulting dis-tribution (Fig 5 last column) Past an 8x zoom in or outwe observe an increase in the FID of the transformed out-put signifying decreasing image quality Interestingly for
zoom the FID under zooming in and zooming out is anti-symmetric indicating that the success to which we are ableto achieve the zoom-in operation under the constraint of re-alistic looking images differs from that of zooming out
These trends are consistent with the plateau in trans-formation behavior that we qualitatively observe in Fig 3Although we can arbitrarily increase the α step size aftersome point we are unable to achieve further transformationand risk deviating from the natural image manifold
42 How does the data affect the transformations
How do the statistics of the data distribution affect theability to achieve these transformations In other words isthe extent to which we can transform each class as we ob-served in Fig 4 due to limited variability in the underlyingdataset for each class
One intuitive way of quantifying the amount of transfor-mation achieved dependent on the variability in the datasetis to measure the difference in transformed model meansmodel(+α) and model(-α) and compare it to thespread of the dataset distribution For each class we com-pute standard deviation of the dataset with respect to ourstatistic of interest (pixel RGB value for color and bound-ing box area and center value for zoom and shift transfor-mations respectively) We hypothesize that if the amount of
000 025 050 075 100
Area
0
2
4
P(A
)
105
000 025 050 075 100
Area
0
2
4
P(A
)
105
020 025 030
Data
00
01
02
03
Model
micro
Zoom
r = 037 p lt 0001
005 010 015
Data
00
01
02
03
Model
micro
Shift Y
r = 039 p lt 0001
010 015 020
Data
00
01
02
03
Model
micro
Shift X
r = 028 p lt 0001
020 025 030
Data
00
01
02
03
Model
micro
Luminance
r = 059 p lt 0001
- Zoom +
Rob
inLa
ptop
Figure 6 Understanding per-class biases We observe a correlation between the variability in the training data for ImageNetclasses and our ability to shift the distribution under latent space transformations Classes with low variability (eg robin)limit our ability to achieve desired transformations in comparison to classes with a broad dataset distribution (eg laptop)
transformation is biased depending on the image class wewill observe a correlation between the distance of the meanshifts and the standard deviation of the data distribution
More concretely we define the change in model meansunder a given transformation as
∆microk =∥∥microkmodel(+αlowast) minus microkmodel(-αlowast)
∥∥2 (3)
for a given class k under the optimal walk for each trans-formation wlowast under Eq 5 and we set αlowast to be largest andsmallest α values used in training We note that the degreeto which we achieve each transformation is a function of αso we use the same value of α across all classes ndash one thatis large enough to separate the means of microkmodel(αlowast) andmicrokmodel(-αlowast) under transformation but also for which theFID of the generated distribution remains below a thresholdT of generating reasonably realistic images (for our experi-ments we use T = 22)
We apply this measure in Fig 6 where on the x-axis weplot the standard deviation σ of the dataset and on the y-axis we show the model ∆micro under a +αlowast andminusαlowast transfor-mation as defined in Eq 3 We sample randomly from 100classes for the color zoom and shift transformations andgenerate 200 samples of each class under the positive andnegative transformations We use the same setup of draw-ing samples from the model and dataset and computing the
statistics for each transformation as described in Sec 41Indeed we find that the width of the dataset distribu-
tion captured by the standard deviation of random samplesdrawn from the dataset for each class is related to the extentto which we achieve each transformation For these trans-formations we observe a positive correlation between thespread of the dataset and the magnitude of ∆micro observedin the transformed model distributions Furthermore theslope of all observed trends differs significantly from zero(p lt 0001 for all transformations)
For the zoom transformation we show examples of twoextremes along the trend We show an example of theldquorobinrdquo class in which the spread σ in the dataset is low andsubsequently the separation ∆micro that we are able to achieveby applying +αlowast and minusαlowast transformations is limited Onthe other hand in the ldquolaptoprdquo class the underlying spreadin the dataset is broad ImageNet contains images of laptopsof various sizes and we are able to attain a wider shift in themodel distribution when taking a αlowast-sized step in the latentspace
From these results we conclude that the extent to whichwe achieve each transformation is related to the variabil-ity that inherently exists in the dataset Consistent with ourqualitative observations in Fig 4 we find that if the im-ages for a particular class have adequate coverage over the
- Zoom + - Zoom +2x 4x
Linear Lpips
8x1x05x025x0125x 2x 4x 8x1x05x025x0125x
Linear L2 Non-linear L2
Non-linear Lpips
Figure 7 Comparison of linear and nonlinear walks for the zoom operation The linear walk undershoots the targeted levelof transformation but maintains more realistic output
entire range of a given transformation then we are betterable to move the model distribution to both extremes Onthe other hand if the images for the given class exhibit lessvariability the extent of our transformation is limited by thisproperty of the dataset
43 Alternative architectures and walks
We ran an identical set of experiments using the nonlin-ear walk in the BigGAN latent space and obtain similarquantitative results To summarize the Pearsonrsquos correla-tion coefficient between dataset σ and model ∆micro for linearwalks and nonlinear walks is shown in Table 1 and fullresults in B42 Qualitatively we observe that while thelinear trajectory undershoots the targeted level of transfor-mation it is able to preserve more realistic-looking results(Fig 7) The transformations involve a trade-off betweenminimizing the loss and maintaining realistic output andwe hypothesize that the linear walk functions as an implicitregularizer that corresponds well with the inherent organi-zation of the latent space
Luminance Shift X Shift Y ZoomLinear 059 028 039 037
Non-linear 049 049 055 060
Table 1 Pearsonrsquos correlation coefficient between datasetσ and model ∆micro for measured attributes p-value for slopelt 0001 for all transformations
To test the generality of our findings across model archi-tecture we ran similar experiments on StyleGAN in whichthe latent space is divided into two spaces z andW As [13]notes that theW space is less entangled than z we apply thelinear walk to W and show results in Fig 8 and B5 Oneinteresting aspect of StyleGAN is that we can change colorwhile leaving other structure in the image unchanged Inother words while green faces do not naturally exist in thedataset the StyleGAN model is still able to generate themThis differs from the behavior of BigGAN where changing
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
- Luminance +
- Red +
- Green +
- Blue +Figure 8 Distribution for luminance transformation learnedfrom the StyleGAN cars generator and qualitative exam-ples of color transformations on various datasets usingStyleGAN
color results in different semantics in the image eg turn-ing a dormant volcano to an active one or a daytime sceneto a nighttime one StyleGAN however does not preservethe exact geometry of objects under other transformationseg zoom and shift (see Sec B5)
5 Conclusion
GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much
AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx
Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC
References[1] A Amini A Soleimany W Schwarting S Bhatia and
D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2
[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2
[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3
[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4
[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3
[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3
[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2
[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2
[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3
[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3
[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3
[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2
[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12
[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24
[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10
[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2
[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3
[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4
[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2
[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2
[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12
[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3
[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2
[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2
[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11
[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3
Appendix
A Method detailsA1 Optimization for the linear walk
We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model
A2 Implementation Details for Linear Walk
We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below
Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage
Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene
Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB
where we jointly learn the three walk directions wR wGand wB
Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget
Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing
A3 Linear NN(z) walk
Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z
Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is
edit(G(z) 2α) = edit(edit(G(z) α) α)
To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows
z1 = z + αF (z)
z2 = z1 + αF (z1)
However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α
z + 2αF (z) = z1 + αF (z1) (4)
Therefore
z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)
Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z
A4 Optimization for the non-linear walk
Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)
We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps
B Additional ExperimentsB1 Model and Data Distributions
How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data
B2 Quantifying Transformation Limits
We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases
Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent
trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)
B3 Detected Bounding Boxes
To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis
B4 Alternative Walks in BigGAN
B41 LPIPS objective
In the main text we learn the latent space walk w by mini-mizing the objective function
J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)
using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)
B42 Non-linear Walks
Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic
2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API
B43 Additional Qualitative Examples
We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively
B5 Walks in StyleGAN
We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses
in Figs 20 22 24
B6 Qualitative examples for additional transfor-mations
Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)
00 02 04 06 08 10Area
05
10
15
20
PD
F
105
data
model
00 02 04 06 08 10Center Y
0
1
2
PD
F
102
data
model
00 02 04 06 08 10Center X
00
05
10
15
20
PD
F
102
data
model
00 02 04 06 08 10Pixel Intensity
2
3
4
5
PD
F
103
data
model
ZoomShift YShift XLuminance
Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset
Luminance
Rotate 2D
Shift X Shift Y
Zoom
10 05 00 05 10crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
50 0 50crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
2 1 0 1 2log(crarr)
00
02
04
06
08
Per
ceptu
alD
ista
nce
Rotate 3D
500 250 0 250 500crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude
Figure 11 Bounding boxes for random selected classes using ImageNet training images
Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α
Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)
Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk
10 05 00 05 10crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100In
ters
ection
2 0 2log(crarr)
000
025
050
075
100
Inte
rsec
tion
020 025 030Data
01
02
03
04
05
Model
micro
010 015 020Data
00
02
04
Model
micro
005 010 015Data
01
02
03
04
Model
micro
02 03Data
02
04
06
08
Model
micro
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
10 05 00 05 10crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
2 0 2log(crarr)
20
30
40
50
FID
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25P
DF
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
p=049 p=049 p=055 p=060
Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation
- Color +- Zoom +
- Shift X + - Shift Y +
- Rotate 2D + - Rotate 3D +
Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
we quickly reach limits The direction in latent space thatzooms toward the dog can only make the dog so big Oncethe dog face fills the full frame continuing to walk in thisdirection fails to increase the zoom A similar effect occursin the daisy example (row 2 of Fig 1) we can find a di-rection in latent space that shifts the daisy up or down butthe daisy gets ldquostuckrdquo at the edge of the frame ndash continuingto walk in this direction does not shift the daisy out of theframe
We hypothesize that these limits are due to biases in thedistribution of images on which the GAN is trained Forexample if dogs and daisies tend to be centered and inframe in the dataset the same may be the case in the GAN-generated images as well Nonetheless we find that somedegree of transformation is possible we can still shift thegenerated daisies up and down in the frame When and whycan we achieve certain transformations but not others
This paper seeks to quantify the degree to which we canachieve basic visual transformations by navigating in GANlatent space In other words are GANs ldquosteerablerdquo in la-tent space1 We further analyze the relationship betweenthe data distribution on which the model is trained and thesuccess in achieving these transformations We find that itis possible to shift the distribution of generated images tosome degree but cannot extrapolate entirely out of the sup-port of the training data In particular we find that attributescan be shifted in proportion to the variability of that attributein the training data
One of the current criticisms of generative models is thatthey simply interpolate between datapoints and fail to gen-erate anything truly new Our results add nuance to thisstory It is possible to achieve distributional shift where themodel generates realistic images from a different distribu-tion than that on which it was trained However the abilityto steer the model in this way depends on the data distribu-tion having sufficient diversity along the dimension we aresteering
Our main contributions are
bull We show a simple walk in the z space of GANs canachieve camera motion and color transformations inthe output space in self-supervised manner without la-beled attributes or distinct source and target images
bull We show this linear walk is as effective as more com-plex non-linear walks suggesting that GANs modelslearn to linearize these operations without being ex-plicitly trained to do so
bull We design measures to quantify the extent and limitsof the transformations generative models can achieveand show experimental evidence We further show the
1We use the term ldquosteerablerdquo in analogy to the classic steerable filtersof Freeman amp Adelson [7]
relation between shiftability of model distribution andthe variability in datasets
bull We demonstrate these transformations as a general-purpose framework on different model architectureseg BigGAN and StyleGAN illustrating differentdisentanglement properties in their respective latentspaces
2 Related workWalking in the latent space can be seen from different
perspectives how to achieve it what limits it and whatdoes it enable us to do Our work addresses these threeaspects together and we briefly refer to each one in priorwork
Interpolations in GAN latent space Traditional ap-proaches to image editing in latent space involve findinglinear directions in GAN latent space that correspond tochanges in labeled attributes such as smile-vectors andgender-vectors for faces [20 13] However smoothly vary-ing latent space interpolations are not exclusive to GANsin flow-based generative models linear interpolations be-tween encoded images allow one to edit a source image to-ward attributes contained in a separate target [16] WithDeep Feature Interpolation [24] one can remove the gener-ative model entirely Instead the interpolation operation isperformed on an intermediate feature space of a pretrainedclassifier again using the feature mappings of source andtarget sets to determine an edit direction Unlike these ap-proaches we learn our latent-space trajectories in a self-supervised manner without relying on labeled attributes ordistinct source and target images We measure the edit dis-tance in pixel space rather than latent space because ourdesired target image may not directly exist in the latentspace of the generative model Despite allowing for non-linear trajectories we find that linear trajectories in the la-tent space model simple image manipulations ndash eg zoom-vectors and shift-vectors ndash surprisingly well even whenmodels were not explicitly trained to exhibit this propertyin latent space
Dataset bias Biases from training data and network ar-chitecture both factor into the generalization capacity oflearned models [23 8 1] Dataset biases partly comes fromhuman preferences in taking photos we typically captureimages in specific ldquocanonicalrdquo views that are not fully rep-resentative of the entire visual world [19 12] When mod-els are trained to fit these datasets they inherit the biasesin the data Such biases may result in models that misrepre-sent the given task ndash such as tendencies towards texture biasrather than shape bias on ImageNet classifiers [8] ndash whichin turn limits their generalization performance on similar
objectives [2] Our latent space trajectories transform theoutput corresponding to various camera motion and imageediting operations but ultimately we are constrained by bi-ases in the data and cannot extrapolate arbitrarily far beyondthe dataset
Deep learning for content creation The recent progressin generative models has enabled interesting applicationsfor content creation [4 13] including variants that en-able end users to control and fine-tune the generated out-put [22 26 3] A by-product the current work is to furtherenable users to modify various image properties by turninga single knob ndash the magnitude of the learned transformationSimilar to [26] we show that GANs allow users to achievebasic image editing operations by manipulating the latentspace However we further demonstrate that these imagemanipulations are not just a simple creativity tool they alsoprovide us with a window into biases and generalization ca-pacity of these models
Concurrent work We note a few concurrent papers thatalso explore trajectories in GAN latent space [6] learns lin-ear walks in the latent space that correspond to various fa-cial characteristics they use these walks to measure biasesin facial attribute detectors whereas we study biases in thegenerative model that originate from training data [21] alsoassumes linear latent space trajectories and learns paths forface attribute editing according to semantic concepts suchas age and expression thus demonstrating disentanglementproperties of the latent space [9] applies a linear walk toachieve transformations in learning and editing features thatpertain cognitive properties of an image such as memorabil-ity aesthetics and emotional valence Unlike these workswe do not require an attribute detector or assessor functionto learn the latent space trajectory and therefore our lossfunction is based on image similarity between source andtarget images In addition to linear walks we explore us-ing non-linear walks to achieve camera motion and colortransformations
3 Method
Generative models such as GANs [10] learn a mappingfunction G such that G z rarr x Here z is the latentcode drawn from a Gaussian density and x is an outputeg an image Our goal is to achieve transformations inthe output space by walking in the latent space as shownin Fig 2 In general this goal also captures the idea inequivariance where transformations in the input space resultin equivalent transformations in the output space (eg referto [11 5 17])
z
αw
Latent Space Generated Image
edit(G(z α))
G(z)
Target Image
G(z + αw)
Z X
z
f(z)
Non-Linear walk Linear walk
z + αwf( f( f( f(z))))
Z
OR
Optimal path
Figure 2 We aim to find a path in z space to transform thegenerated image G(z) to its edited version edit(G(z α))eg an αtimes zoom This walk results in the generated imageG(z+αw) when we choose a linear walk orG(f(f((z)))when we choose a non-linear walk
Objective How can we maneuver in the latent space todiscover different transformations on a given image Wedesire to learn a vector representing the optimal path forthis latent space transformation which we parametrize asan N -dimensional learned vector We weight this vector bya continuous parameter α which signifies the step size ofthe transformation large α values correspond to a greaterdegree of transformation while small α values correspondto a lesser degree Formally we learn the walk w by mini-mizing the objective function
wlowast = argminw
Ezα[L(G(z+αw)edit(G(z) α))] (1)
Here L measures the distance between the generated im-age after taking an α-step in the latent direction G(z+αw)and the target edit(G(z) α) derived from the source im-age G(z) We use L2 loss as our objective L however wealso obtain similar results when using the LPIPS perceptualimage similarity metric [25] (see B41) Note that we canlearn this walk in a fully self-supervised manner ndash we per-form our desired transformation on an arbitrary generatedimage and subsequently we optimize our walk to satisfythe objective We learn the optimal walk wlowast to model thetransformation applied in edit(middot) Let model(α) de-note such transformation in the direction of wlowast controlledwith the variable step size α defined as model(α) =G(z + αwlowast)
The previous setup assumes linear latent space walks butwe can also optimize for non-linear trajectories in whichthe walk direction depends on the current latent spaceposition For the non-linear walk we learn a functionflowast(z) which corresponds to a small ε-step transforma-tion edit(G(z) ε) To achieve bigger transformations welearn to apply f recursively mimicking discrete Euler ODEapproximations Formally for a fixed ε we minimize
L = Ezn[||fn(z)minus edit(G(z) nε))||] (2)
where fn(middot) is an nth-order function compositionf(f(f())) and f(z) is parametrized with a neural net-work We discuss further implementation details in A4
- Rotate 2D +058 005005
- Shift Y +058 013021
032 003001- Zoom +
Figure 3 Transformation limits As we increase the magnitude of the learned walk vector we reach a limit to how muchwe can achieve a given transformation The learned operation either ceases to transform the image any further or the imagestarts to deviate from the natural image manifold Below each figure we also indicate the average LPIPS perceptual distancebetween 200 sampled image pairs of that category Perceptual distance decreases as we move farther from the source (centerimage) which indicates that the images are converging
We use this function composition approach rather than thesimpler setup ofG(z+αNN(z)) because the latter learns toignore the input z when α takes on continuous values andis thus equivalent to the previous linear trajectory (see A3for further details)
Quantifying Steerability We further seek to quantify theextent to which we are able to achieve the desired imagemanipulations with respect to each transformation To thisend we compare the distribution of a given attribute egldquoluminancerdquo in the dataset versus in images generated afterwalking in latent space
For color transformations we consider the effect of in-creasing or decreasing the α coefficient corresponding toeach color channel To estimate the color distribution ofmodel-generated images we randomly sample N = 100pixels per image both before and after taking a step in latentspace Then we compute the pixel value for each channelor the mean RGB value for luminance and normalize therange between 0 and 1
For zoom and shift transformations we rely on an ob-ject detector which captures the central object in the imageclass We use a MobileNet-SSD v1 [18] detector to esti-mate object bounding boxes and restrict ourselves to imageclasses recognizable by the detector For each successfuldetection we take the highest probability bounding box cor-responding to the desired class and use that to quantify thedegree to which we are able to achieve each transformationFor the zoom operation we use the area of the boundingbox normalized by the area of the total image For shift in
the X and Y directions we take the center X and Y coordi-nates of the bounding box and normalize by image widthor height
Truncation parameters in GANs (as used in [4 13]) tradeoff between variety of the generated images and samplequality When comparing generated images to the datasetdistribution we use the largest possible truncation we usethe largest possible truncation for the model and performsimilar cropping and resizing of the dataset as done dur-ing model training (see [4]) When comparing the attributesof generated distributions under different α magnitudes toeach other but not to the dataset we reduce truncation to05 to ensure better performance of the object detector
4 ExperimentsWe demonstrate our approach using BigGAN [4] a
class-conditional GAN trained on 1000 ImageNet cate-gories We learn a shared latent space walk by averagingacross the image categories and further quantify how thiswalk affects each class differently We focus on linear walksin latent space for the main text and show additional resultson nonlinear walks in Sec 43 and B42 We also conductexperiments on StyleGAN [13] which uses an uncondi-tional style-based generator architecture in Sec 43 and B5
41 What image transformations can we achieve bysteering in the latent space
We show qualitative results of the learned transforma-tions in Fig 1 By walking in the generator latent spacewe are able to learn a variety of transformations on a given
source image (shown in the center image of each transfor-mation) Interestingly several priors come into play whenlearning these image transformations When we shift adaisy downwards in the Y direction the model hallucinatesthat the sky exists on the top of the image However whenwe shift the daisy up the model inpaints the remainder ofthe image with grass When we alter the brightness of a im-age the model transitions between nighttime and daytimeThis suggests that the model is able to extrapolate from theoriginal source image and yet remain consistent with theimage context
We further ask the question how much are we able toachieve each transformation When we increase the stepsize of α we observe that the extent to which we canachieve each transformation is limited In Fig 3 we ob-serve two potential failure cases one in which the the im-age becomes unrealistic and the other in which the imagefails to transform any further When we try to zoom in ona Persian cat we observe that the cat no longer increases insize beyond some point In fact the size of the cat underthe latent space transformation consistently undershoots thetarget zoom On the other hand when we try to zoom out onthe cat we observe that it begins to fall off the image man-ifold and at some point is unable to become any smallerIndeed the perceptual distance (using LPIPS) between im-ages decreases as we push α towards the transformationlimits Similar trends hold with other transformations weare able to shift a lorikeet up and down to some degree untilthe transformation fails to have any further effect and yieldsunrealistic output and despite adjusting α on the rotationvector we are unable to rotate a pizza Are the limitationsto these transformations governed by the training datasetIn other words are our latent space walks limited becausein ImageNet photos the cats are mostly centered and takenwithin a certain size We seek to investigate and quantifythese biases in the next sections
An intriguing property of the learned trajectory is thatthe degree to which it affects the output depends on the im-age class In Fig 4 we investigate the impact of the walkfor different image categories under color transformationsBy stepping w in the direction of increasing redness in theimage we are able to successfully recolor a jellyfish but weare unable to change the color of a goldfinch ndash the goldfinchremains yellow and the only effect of w is to slightly red-den the background Likewise increasing the brightness ofa scene transforms an erupting volcano to a dormant vol-cano but does not have such effect on Alps which onlytransitions between night and day In the third example weuse our latent walk to turn red sports cars to blue When weapply this vector to firetrucks the only effect is to slightlychange the color of the sky Again perceptual distance overmultiple image samples per class confirms these qualitativeobservations a 2-sample t-test yields t = 2077 p lt 0001
for jellyfishgoldfinch t = 814 p lt 0001 for volcanoalpand t = 684 p lt 0001 for sports carfire engine Wehypothesize that the differential impact of the same trans-formation on various image classes relates to the variabilityin the underlying dataset Firetrucks only appear in redbut sports cars appear in a variety of colors Therefore ourcolor transformation is constrained by the biases exhibitedin individual classes in the dataset
- Blueness +
- Darkness +
- Redness +jellyfish goldfinch
00
05
10
Per
ceptu
alD
ista
nce
sports car fire engine00
02
04
Per
ceptu
alD
ista
nce
volcano alp00
05
Per
ceptu
alD
ista
nce
Figure 4 Each pair of rows shows how a single latent direc-tion w affects two different ImageNet classes We observethat the same direction affects different classes differentlyand in a way consistent with semantic priors (eg ldquoVolca-noesrdquo explode ldquoAlpsrdquo do not) Boxplots show the LPIPSperceptual distance before and after transformation for 200samples per class
Motivated by our qualitative observations in Fig 1Fig 3 and Fig 4 we next quantitatively investigate thedegree to which we are able to achieve certain transforma-tions
Limits of Transformations In quantifying the extent ofour transformations we are bounded by two opposing fac-tors how much we can transform the distribution while alsoremaining within the manifold of realistic looking images
For the color (luminance) transformation we observethat by changing α we shift the proportion of light anddark colored pixels and can do so without large increases inFID (Fig 5 first column) However the extent to which wecan move the distribution is limited and each transformeddistribution still retains some probability mass over the fullrange of pixel intensities
10 05 00 05 10crarr
00
05
10
Inte
rsec
tion
200 100 0 100 200crarr
00
05
10
Inte
rsec
tion
200 100 0 100 200crarr
00
05
10
Inte
rsec
tion
2 0 2log(crarr)
00
05
10
Inte
rsec
tion
00 02 04 06 08 10Pixel Intensity
000
025
050
075
100
125
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
20
25
30
PD
F
105
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
10 05 00 05 10crarr
20
40
FID
200 100 0 100 200crarr
20
40
FID
200 100 0 100 200crarr
20
40
FID
2 0 2log(crarr)
20
40
FID
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Pixel Intensity
000
025
050
075
100
125
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
20
25
30
PD
F
105
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
Figure 5 Quantifying the extent of transformations We compare the attributes of generated images under the raw modeloutput G(z) compared to the distribution under a learned transformation model(α) We measure the intersection betweenG(z) and model(α) and also compute the FID on the transformed image to limit our transformations to the natural imagemanifold
Using the shift vector we show that we are able to movethe distribution of the center object by changing α In theunderlying model the center coordinate of the object ismost concentrated at half of the image width and height butafter applying the shift in X and shift in Y transformationthe mode of the transformed distribution varies between 03and 07 of the image widthheight To quantify the extentto which the distribution changes we compute the area ofintersection between the original model distribution and thedistribution after applying the transformation and we ob-serve that the intersection decreases as we increase or de-crease the magnitude of α However our transformationsare limited to a certain extent ndash if we increase α beyond 150pixels for vertical shifts we start to generate unrealistic im-ages as evidenced by a sharp rise in FID and convergingmodes in the transformed distributions (Fig 5 columns 2 amp3)
We perform a similar procedure for the zoom operationby measuring the area of the bounding box for the detectedobject under different magnitudes of α Like shift we ob-serve that subsequent increases in α magnitude start to havesmaller and smaller effects on the mode of the resulting dis-tribution (Fig 5 last column) Past an 8x zoom in or outwe observe an increase in the FID of the transformed out-put signifying decreasing image quality Interestingly for
zoom the FID under zooming in and zooming out is anti-symmetric indicating that the success to which we are ableto achieve the zoom-in operation under the constraint of re-alistic looking images differs from that of zooming out
These trends are consistent with the plateau in trans-formation behavior that we qualitatively observe in Fig 3Although we can arbitrarily increase the α step size aftersome point we are unable to achieve further transformationand risk deviating from the natural image manifold
42 How does the data affect the transformations
How do the statistics of the data distribution affect theability to achieve these transformations In other words isthe extent to which we can transform each class as we ob-served in Fig 4 due to limited variability in the underlyingdataset for each class
One intuitive way of quantifying the amount of transfor-mation achieved dependent on the variability in the datasetis to measure the difference in transformed model meansmodel(+α) and model(-α) and compare it to thespread of the dataset distribution For each class we com-pute standard deviation of the dataset with respect to ourstatistic of interest (pixel RGB value for color and bound-ing box area and center value for zoom and shift transfor-mations respectively) We hypothesize that if the amount of
000 025 050 075 100
Area
0
2
4
P(A
)
105
000 025 050 075 100
Area
0
2
4
P(A
)
105
020 025 030
Data
00
01
02
03
Model
micro
Zoom
r = 037 p lt 0001
005 010 015
Data
00
01
02
03
Model
micro
Shift Y
r = 039 p lt 0001
010 015 020
Data
00
01
02
03
Model
micro
Shift X
r = 028 p lt 0001
020 025 030
Data
00
01
02
03
Model
micro
Luminance
r = 059 p lt 0001
- Zoom +
Rob
inLa
ptop
Figure 6 Understanding per-class biases We observe a correlation between the variability in the training data for ImageNetclasses and our ability to shift the distribution under latent space transformations Classes with low variability (eg robin)limit our ability to achieve desired transformations in comparison to classes with a broad dataset distribution (eg laptop)
transformation is biased depending on the image class wewill observe a correlation between the distance of the meanshifts and the standard deviation of the data distribution
More concretely we define the change in model meansunder a given transformation as
∆microk =∥∥microkmodel(+αlowast) minus microkmodel(-αlowast)
∥∥2 (3)
for a given class k under the optimal walk for each trans-formation wlowast under Eq 5 and we set αlowast to be largest andsmallest α values used in training We note that the degreeto which we achieve each transformation is a function of αso we use the same value of α across all classes ndash one thatis large enough to separate the means of microkmodel(αlowast) andmicrokmodel(-αlowast) under transformation but also for which theFID of the generated distribution remains below a thresholdT of generating reasonably realistic images (for our experi-ments we use T = 22)
We apply this measure in Fig 6 where on the x-axis weplot the standard deviation σ of the dataset and on the y-axis we show the model ∆micro under a +αlowast andminusαlowast transfor-mation as defined in Eq 3 We sample randomly from 100classes for the color zoom and shift transformations andgenerate 200 samples of each class under the positive andnegative transformations We use the same setup of draw-ing samples from the model and dataset and computing the
statistics for each transformation as described in Sec 41Indeed we find that the width of the dataset distribu-
tion captured by the standard deviation of random samplesdrawn from the dataset for each class is related to the extentto which we achieve each transformation For these trans-formations we observe a positive correlation between thespread of the dataset and the magnitude of ∆micro observedin the transformed model distributions Furthermore theslope of all observed trends differs significantly from zero(p lt 0001 for all transformations)
For the zoom transformation we show examples of twoextremes along the trend We show an example of theldquorobinrdquo class in which the spread σ in the dataset is low andsubsequently the separation ∆micro that we are able to achieveby applying +αlowast and minusαlowast transformations is limited Onthe other hand in the ldquolaptoprdquo class the underlying spreadin the dataset is broad ImageNet contains images of laptopsof various sizes and we are able to attain a wider shift in themodel distribution when taking a αlowast-sized step in the latentspace
From these results we conclude that the extent to whichwe achieve each transformation is related to the variabil-ity that inherently exists in the dataset Consistent with ourqualitative observations in Fig 4 we find that if the im-ages for a particular class have adequate coverage over the
- Zoom + - Zoom +2x 4x
Linear Lpips
8x1x05x025x0125x 2x 4x 8x1x05x025x0125x
Linear L2 Non-linear L2
Non-linear Lpips
Figure 7 Comparison of linear and nonlinear walks for the zoom operation The linear walk undershoots the targeted levelof transformation but maintains more realistic output
entire range of a given transformation then we are betterable to move the model distribution to both extremes Onthe other hand if the images for the given class exhibit lessvariability the extent of our transformation is limited by thisproperty of the dataset
43 Alternative architectures and walks
We ran an identical set of experiments using the nonlin-ear walk in the BigGAN latent space and obtain similarquantitative results To summarize the Pearsonrsquos correla-tion coefficient between dataset σ and model ∆micro for linearwalks and nonlinear walks is shown in Table 1 and fullresults in B42 Qualitatively we observe that while thelinear trajectory undershoots the targeted level of transfor-mation it is able to preserve more realistic-looking results(Fig 7) The transformations involve a trade-off betweenminimizing the loss and maintaining realistic output andwe hypothesize that the linear walk functions as an implicitregularizer that corresponds well with the inherent organi-zation of the latent space
Luminance Shift X Shift Y ZoomLinear 059 028 039 037
Non-linear 049 049 055 060
Table 1 Pearsonrsquos correlation coefficient between datasetσ and model ∆micro for measured attributes p-value for slopelt 0001 for all transformations
To test the generality of our findings across model archi-tecture we ran similar experiments on StyleGAN in whichthe latent space is divided into two spaces z andW As [13]notes that theW space is less entangled than z we apply thelinear walk to W and show results in Fig 8 and B5 Oneinteresting aspect of StyleGAN is that we can change colorwhile leaving other structure in the image unchanged Inother words while green faces do not naturally exist in thedataset the StyleGAN model is still able to generate themThis differs from the behavior of BigGAN where changing
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
- Luminance +
- Red +
- Green +
- Blue +Figure 8 Distribution for luminance transformation learnedfrom the StyleGAN cars generator and qualitative exam-ples of color transformations on various datasets usingStyleGAN
color results in different semantics in the image eg turn-ing a dormant volcano to an active one or a daytime sceneto a nighttime one StyleGAN however does not preservethe exact geometry of objects under other transformationseg zoom and shift (see Sec B5)
5 Conclusion
GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much
AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx
Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC
References[1] A Amini A Soleimany W Schwarting S Bhatia and
D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2
[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2
[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3
[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4
[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3
[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3
[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2
[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2
[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3
[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3
[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3
[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2
[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12
[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24
[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10
[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2
[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3
[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4
[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2
[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2
[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12
[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3
[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2
[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2
[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11
[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3
Appendix
A Method detailsA1 Optimization for the linear walk
We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model
A2 Implementation Details for Linear Walk
We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below
Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage
Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene
Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB
where we jointly learn the three walk directions wR wGand wB
Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget
Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing
A3 Linear NN(z) walk
Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z
Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is
edit(G(z) 2α) = edit(edit(G(z) α) α)
To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows
z1 = z + αF (z)
z2 = z1 + αF (z1)
However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α
z + 2αF (z) = z1 + αF (z1) (4)
Therefore
z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)
Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z
A4 Optimization for the non-linear walk
Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)
We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps
B Additional ExperimentsB1 Model and Data Distributions
How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data
B2 Quantifying Transformation Limits
We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases
Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent
trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)
B3 Detected Bounding Boxes
To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis
B4 Alternative Walks in BigGAN
B41 LPIPS objective
In the main text we learn the latent space walk w by mini-mizing the objective function
J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)
using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)
B42 Non-linear Walks
Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic
2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API
B43 Additional Qualitative Examples
We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively
B5 Walks in StyleGAN
We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses
in Figs 20 22 24
B6 Qualitative examples for additional transfor-mations
Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)
00 02 04 06 08 10Area
05
10
15
20
PD
F
105
data
model
00 02 04 06 08 10Center Y
0
1
2
PD
F
102
data
model
00 02 04 06 08 10Center X
00
05
10
15
20
PD
F
102
data
model
00 02 04 06 08 10Pixel Intensity
2
3
4
5
PD
F
103
data
model
ZoomShift YShift XLuminance
Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset
Luminance
Rotate 2D
Shift X Shift Y
Zoom
10 05 00 05 10crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
50 0 50crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
2 1 0 1 2log(crarr)
00
02
04
06
08
Per
ceptu
alD
ista
nce
Rotate 3D
500 250 0 250 500crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude
Figure 11 Bounding boxes for random selected classes using ImageNet training images
Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α
Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)
Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk
10 05 00 05 10crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100In
ters
ection
2 0 2log(crarr)
000
025
050
075
100
Inte
rsec
tion
020 025 030Data
01
02
03
04
05
Model
micro
010 015 020Data
00
02
04
Model
micro
005 010 015Data
01
02
03
04
Model
micro
02 03Data
02
04
06
08
Model
micro
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
10 05 00 05 10crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
2 0 2log(crarr)
20
30
40
50
FID
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25P
DF
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
p=049 p=049 p=055 p=060
Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation
- Color +- Zoom +
- Shift X + - Shift Y +
- Rotate 2D + - Rotate 3D +
Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
objectives [2] Our latent space trajectories transform theoutput corresponding to various camera motion and imageediting operations but ultimately we are constrained by bi-ases in the data and cannot extrapolate arbitrarily far beyondthe dataset
Deep learning for content creation The recent progressin generative models has enabled interesting applicationsfor content creation [4 13] including variants that en-able end users to control and fine-tune the generated out-put [22 26 3] A by-product the current work is to furtherenable users to modify various image properties by turninga single knob ndash the magnitude of the learned transformationSimilar to [26] we show that GANs allow users to achievebasic image editing operations by manipulating the latentspace However we further demonstrate that these imagemanipulations are not just a simple creativity tool they alsoprovide us with a window into biases and generalization ca-pacity of these models
Concurrent work We note a few concurrent papers thatalso explore trajectories in GAN latent space [6] learns lin-ear walks in the latent space that correspond to various fa-cial characteristics they use these walks to measure biasesin facial attribute detectors whereas we study biases in thegenerative model that originate from training data [21] alsoassumes linear latent space trajectories and learns paths forface attribute editing according to semantic concepts suchas age and expression thus demonstrating disentanglementproperties of the latent space [9] applies a linear walk toachieve transformations in learning and editing features thatpertain cognitive properties of an image such as memorabil-ity aesthetics and emotional valence Unlike these workswe do not require an attribute detector or assessor functionto learn the latent space trajectory and therefore our lossfunction is based on image similarity between source andtarget images In addition to linear walks we explore us-ing non-linear walks to achieve camera motion and colortransformations
3 Method
Generative models such as GANs [10] learn a mappingfunction G such that G z rarr x Here z is the latentcode drawn from a Gaussian density and x is an outputeg an image Our goal is to achieve transformations inthe output space by walking in the latent space as shownin Fig 2 In general this goal also captures the idea inequivariance where transformations in the input space resultin equivalent transformations in the output space (eg referto [11 5 17])
z
αw
Latent Space Generated Image
edit(G(z α))
G(z)
Target Image
G(z + αw)
Z X
z
f(z)
Non-Linear walk Linear walk
z + αwf( f( f( f(z))))
Z
OR
Optimal path
Figure 2 We aim to find a path in z space to transform thegenerated image G(z) to its edited version edit(G(z α))eg an αtimes zoom This walk results in the generated imageG(z+αw) when we choose a linear walk orG(f(f((z)))when we choose a non-linear walk
Objective How can we maneuver in the latent space todiscover different transformations on a given image Wedesire to learn a vector representing the optimal path forthis latent space transformation which we parametrize asan N -dimensional learned vector We weight this vector bya continuous parameter α which signifies the step size ofthe transformation large α values correspond to a greaterdegree of transformation while small α values correspondto a lesser degree Formally we learn the walk w by mini-mizing the objective function
wlowast = argminw
Ezα[L(G(z+αw)edit(G(z) α))] (1)
Here L measures the distance between the generated im-age after taking an α-step in the latent direction G(z+αw)and the target edit(G(z) α) derived from the source im-age G(z) We use L2 loss as our objective L however wealso obtain similar results when using the LPIPS perceptualimage similarity metric [25] (see B41) Note that we canlearn this walk in a fully self-supervised manner ndash we per-form our desired transformation on an arbitrary generatedimage and subsequently we optimize our walk to satisfythe objective We learn the optimal walk wlowast to model thetransformation applied in edit(middot) Let model(α) de-note such transformation in the direction of wlowast controlledwith the variable step size α defined as model(α) =G(z + αwlowast)
The previous setup assumes linear latent space walks butwe can also optimize for non-linear trajectories in whichthe walk direction depends on the current latent spaceposition For the non-linear walk we learn a functionflowast(z) which corresponds to a small ε-step transforma-tion edit(G(z) ε) To achieve bigger transformations welearn to apply f recursively mimicking discrete Euler ODEapproximations Formally for a fixed ε we minimize
L = Ezn[||fn(z)minus edit(G(z) nε))||] (2)
where fn(middot) is an nth-order function compositionf(f(f())) and f(z) is parametrized with a neural net-work We discuss further implementation details in A4
- Rotate 2D +058 005005
- Shift Y +058 013021
032 003001- Zoom +
Figure 3 Transformation limits As we increase the magnitude of the learned walk vector we reach a limit to how muchwe can achieve a given transformation The learned operation either ceases to transform the image any further or the imagestarts to deviate from the natural image manifold Below each figure we also indicate the average LPIPS perceptual distancebetween 200 sampled image pairs of that category Perceptual distance decreases as we move farther from the source (centerimage) which indicates that the images are converging
We use this function composition approach rather than thesimpler setup ofG(z+αNN(z)) because the latter learns toignore the input z when α takes on continuous values andis thus equivalent to the previous linear trajectory (see A3for further details)
Quantifying Steerability We further seek to quantify theextent to which we are able to achieve the desired imagemanipulations with respect to each transformation To thisend we compare the distribution of a given attribute egldquoluminancerdquo in the dataset versus in images generated afterwalking in latent space
For color transformations we consider the effect of in-creasing or decreasing the α coefficient corresponding toeach color channel To estimate the color distribution ofmodel-generated images we randomly sample N = 100pixels per image both before and after taking a step in latentspace Then we compute the pixel value for each channelor the mean RGB value for luminance and normalize therange between 0 and 1
For zoom and shift transformations we rely on an ob-ject detector which captures the central object in the imageclass We use a MobileNet-SSD v1 [18] detector to esti-mate object bounding boxes and restrict ourselves to imageclasses recognizable by the detector For each successfuldetection we take the highest probability bounding box cor-responding to the desired class and use that to quantify thedegree to which we are able to achieve each transformationFor the zoom operation we use the area of the boundingbox normalized by the area of the total image For shift in
the X and Y directions we take the center X and Y coordi-nates of the bounding box and normalize by image widthor height
Truncation parameters in GANs (as used in [4 13]) tradeoff between variety of the generated images and samplequality When comparing generated images to the datasetdistribution we use the largest possible truncation we usethe largest possible truncation for the model and performsimilar cropping and resizing of the dataset as done dur-ing model training (see [4]) When comparing the attributesof generated distributions under different α magnitudes toeach other but not to the dataset we reduce truncation to05 to ensure better performance of the object detector
4 ExperimentsWe demonstrate our approach using BigGAN [4] a
class-conditional GAN trained on 1000 ImageNet cate-gories We learn a shared latent space walk by averagingacross the image categories and further quantify how thiswalk affects each class differently We focus on linear walksin latent space for the main text and show additional resultson nonlinear walks in Sec 43 and B42 We also conductexperiments on StyleGAN [13] which uses an uncondi-tional style-based generator architecture in Sec 43 and B5
41 What image transformations can we achieve bysteering in the latent space
We show qualitative results of the learned transforma-tions in Fig 1 By walking in the generator latent spacewe are able to learn a variety of transformations on a given
source image (shown in the center image of each transfor-mation) Interestingly several priors come into play whenlearning these image transformations When we shift adaisy downwards in the Y direction the model hallucinatesthat the sky exists on the top of the image However whenwe shift the daisy up the model inpaints the remainder ofthe image with grass When we alter the brightness of a im-age the model transitions between nighttime and daytimeThis suggests that the model is able to extrapolate from theoriginal source image and yet remain consistent with theimage context
We further ask the question how much are we able toachieve each transformation When we increase the stepsize of α we observe that the extent to which we canachieve each transformation is limited In Fig 3 we ob-serve two potential failure cases one in which the the im-age becomes unrealistic and the other in which the imagefails to transform any further When we try to zoom in ona Persian cat we observe that the cat no longer increases insize beyond some point In fact the size of the cat underthe latent space transformation consistently undershoots thetarget zoom On the other hand when we try to zoom out onthe cat we observe that it begins to fall off the image man-ifold and at some point is unable to become any smallerIndeed the perceptual distance (using LPIPS) between im-ages decreases as we push α towards the transformationlimits Similar trends hold with other transformations weare able to shift a lorikeet up and down to some degree untilthe transformation fails to have any further effect and yieldsunrealistic output and despite adjusting α on the rotationvector we are unable to rotate a pizza Are the limitationsto these transformations governed by the training datasetIn other words are our latent space walks limited becausein ImageNet photos the cats are mostly centered and takenwithin a certain size We seek to investigate and quantifythese biases in the next sections
An intriguing property of the learned trajectory is thatthe degree to which it affects the output depends on the im-age class In Fig 4 we investigate the impact of the walkfor different image categories under color transformationsBy stepping w in the direction of increasing redness in theimage we are able to successfully recolor a jellyfish but weare unable to change the color of a goldfinch ndash the goldfinchremains yellow and the only effect of w is to slightly red-den the background Likewise increasing the brightness ofa scene transforms an erupting volcano to a dormant vol-cano but does not have such effect on Alps which onlytransitions between night and day In the third example weuse our latent walk to turn red sports cars to blue When weapply this vector to firetrucks the only effect is to slightlychange the color of the sky Again perceptual distance overmultiple image samples per class confirms these qualitativeobservations a 2-sample t-test yields t = 2077 p lt 0001
for jellyfishgoldfinch t = 814 p lt 0001 for volcanoalpand t = 684 p lt 0001 for sports carfire engine Wehypothesize that the differential impact of the same trans-formation on various image classes relates to the variabilityin the underlying dataset Firetrucks only appear in redbut sports cars appear in a variety of colors Therefore ourcolor transformation is constrained by the biases exhibitedin individual classes in the dataset
- Blueness +
- Darkness +
- Redness +jellyfish goldfinch
00
05
10
Per
ceptu
alD
ista
nce
sports car fire engine00
02
04
Per
ceptu
alD
ista
nce
volcano alp00
05
Per
ceptu
alD
ista
nce
Figure 4 Each pair of rows shows how a single latent direc-tion w affects two different ImageNet classes We observethat the same direction affects different classes differentlyand in a way consistent with semantic priors (eg ldquoVolca-noesrdquo explode ldquoAlpsrdquo do not) Boxplots show the LPIPSperceptual distance before and after transformation for 200samples per class
Motivated by our qualitative observations in Fig 1Fig 3 and Fig 4 we next quantitatively investigate thedegree to which we are able to achieve certain transforma-tions
Limits of Transformations In quantifying the extent ofour transformations we are bounded by two opposing fac-tors how much we can transform the distribution while alsoremaining within the manifold of realistic looking images
For the color (luminance) transformation we observethat by changing α we shift the proportion of light anddark colored pixels and can do so without large increases inFID (Fig 5 first column) However the extent to which wecan move the distribution is limited and each transformeddistribution still retains some probability mass over the fullrange of pixel intensities
10 05 00 05 10crarr
00
05
10
Inte
rsec
tion
200 100 0 100 200crarr
00
05
10
Inte
rsec
tion
200 100 0 100 200crarr
00
05
10
Inte
rsec
tion
2 0 2log(crarr)
00
05
10
Inte
rsec
tion
00 02 04 06 08 10Pixel Intensity
000
025
050
075
100
125
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
20
25
30
PD
F
105
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
10 05 00 05 10crarr
20
40
FID
200 100 0 100 200crarr
20
40
FID
200 100 0 100 200crarr
20
40
FID
2 0 2log(crarr)
20
40
FID
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Pixel Intensity
000
025
050
075
100
125
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
20
25
30
PD
F
105
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
Figure 5 Quantifying the extent of transformations We compare the attributes of generated images under the raw modeloutput G(z) compared to the distribution under a learned transformation model(α) We measure the intersection betweenG(z) and model(α) and also compute the FID on the transformed image to limit our transformations to the natural imagemanifold
Using the shift vector we show that we are able to movethe distribution of the center object by changing α In theunderlying model the center coordinate of the object ismost concentrated at half of the image width and height butafter applying the shift in X and shift in Y transformationthe mode of the transformed distribution varies between 03and 07 of the image widthheight To quantify the extentto which the distribution changes we compute the area ofintersection between the original model distribution and thedistribution after applying the transformation and we ob-serve that the intersection decreases as we increase or de-crease the magnitude of α However our transformationsare limited to a certain extent ndash if we increase α beyond 150pixels for vertical shifts we start to generate unrealistic im-ages as evidenced by a sharp rise in FID and convergingmodes in the transformed distributions (Fig 5 columns 2 amp3)
We perform a similar procedure for the zoom operationby measuring the area of the bounding box for the detectedobject under different magnitudes of α Like shift we ob-serve that subsequent increases in α magnitude start to havesmaller and smaller effects on the mode of the resulting dis-tribution (Fig 5 last column) Past an 8x zoom in or outwe observe an increase in the FID of the transformed out-put signifying decreasing image quality Interestingly for
zoom the FID under zooming in and zooming out is anti-symmetric indicating that the success to which we are ableto achieve the zoom-in operation under the constraint of re-alistic looking images differs from that of zooming out
These trends are consistent with the plateau in trans-formation behavior that we qualitatively observe in Fig 3Although we can arbitrarily increase the α step size aftersome point we are unable to achieve further transformationand risk deviating from the natural image manifold
42 How does the data affect the transformations
How do the statistics of the data distribution affect theability to achieve these transformations In other words isthe extent to which we can transform each class as we ob-served in Fig 4 due to limited variability in the underlyingdataset for each class
One intuitive way of quantifying the amount of transfor-mation achieved dependent on the variability in the datasetis to measure the difference in transformed model meansmodel(+α) and model(-α) and compare it to thespread of the dataset distribution For each class we com-pute standard deviation of the dataset with respect to ourstatistic of interest (pixel RGB value for color and bound-ing box area and center value for zoom and shift transfor-mations respectively) We hypothesize that if the amount of
000 025 050 075 100
Area
0
2
4
P(A
)
105
000 025 050 075 100
Area
0
2
4
P(A
)
105
020 025 030
Data
00
01
02
03
Model
micro
Zoom
r = 037 p lt 0001
005 010 015
Data
00
01
02
03
Model
micro
Shift Y
r = 039 p lt 0001
010 015 020
Data
00
01
02
03
Model
micro
Shift X
r = 028 p lt 0001
020 025 030
Data
00
01
02
03
Model
micro
Luminance
r = 059 p lt 0001
- Zoom +
Rob
inLa
ptop
Figure 6 Understanding per-class biases We observe a correlation between the variability in the training data for ImageNetclasses and our ability to shift the distribution under latent space transformations Classes with low variability (eg robin)limit our ability to achieve desired transformations in comparison to classes with a broad dataset distribution (eg laptop)
transformation is biased depending on the image class wewill observe a correlation between the distance of the meanshifts and the standard deviation of the data distribution
More concretely we define the change in model meansunder a given transformation as
∆microk =∥∥microkmodel(+αlowast) minus microkmodel(-αlowast)
∥∥2 (3)
for a given class k under the optimal walk for each trans-formation wlowast under Eq 5 and we set αlowast to be largest andsmallest α values used in training We note that the degreeto which we achieve each transformation is a function of αso we use the same value of α across all classes ndash one thatis large enough to separate the means of microkmodel(αlowast) andmicrokmodel(-αlowast) under transformation but also for which theFID of the generated distribution remains below a thresholdT of generating reasonably realistic images (for our experi-ments we use T = 22)
We apply this measure in Fig 6 where on the x-axis weplot the standard deviation σ of the dataset and on the y-axis we show the model ∆micro under a +αlowast andminusαlowast transfor-mation as defined in Eq 3 We sample randomly from 100classes for the color zoom and shift transformations andgenerate 200 samples of each class under the positive andnegative transformations We use the same setup of draw-ing samples from the model and dataset and computing the
statistics for each transformation as described in Sec 41Indeed we find that the width of the dataset distribu-
tion captured by the standard deviation of random samplesdrawn from the dataset for each class is related to the extentto which we achieve each transformation For these trans-formations we observe a positive correlation between thespread of the dataset and the magnitude of ∆micro observedin the transformed model distributions Furthermore theslope of all observed trends differs significantly from zero(p lt 0001 for all transformations)
For the zoom transformation we show examples of twoextremes along the trend We show an example of theldquorobinrdquo class in which the spread σ in the dataset is low andsubsequently the separation ∆micro that we are able to achieveby applying +αlowast and minusαlowast transformations is limited Onthe other hand in the ldquolaptoprdquo class the underlying spreadin the dataset is broad ImageNet contains images of laptopsof various sizes and we are able to attain a wider shift in themodel distribution when taking a αlowast-sized step in the latentspace
From these results we conclude that the extent to whichwe achieve each transformation is related to the variabil-ity that inherently exists in the dataset Consistent with ourqualitative observations in Fig 4 we find that if the im-ages for a particular class have adequate coverage over the
- Zoom + - Zoom +2x 4x
Linear Lpips
8x1x05x025x0125x 2x 4x 8x1x05x025x0125x
Linear L2 Non-linear L2
Non-linear Lpips
Figure 7 Comparison of linear and nonlinear walks for the zoom operation The linear walk undershoots the targeted levelof transformation but maintains more realistic output
entire range of a given transformation then we are betterable to move the model distribution to both extremes Onthe other hand if the images for the given class exhibit lessvariability the extent of our transformation is limited by thisproperty of the dataset
43 Alternative architectures and walks
We ran an identical set of experiments using the nonlin-ear walk in the BigGAN latent space and obtain similarquantitative results To summarize the Pearsonrsquos correla-tion coefficient between dataset σ and model ∆micro for linearwalks and nonlinear walks is shown in Table 1 and fullresults in B42 Qualitatively we observe that while thelinear trajectory undershoots the targeted level of transfor-mation it is able to preserve more realistic-looking results(Fig 7) The transformations involve a trade-off betweenminimizing the loss and maintaining realistic output andwe hypothesize that the linear walk functions as an implicitregularizer that corresponds well with the inherent organi-zation of the latent space
Luminance Shift X Shift Y ZoomLinear 059 028 039 037
Non-linear 049 049 055 060
Table 1 Pearsonrsquos correlation coefficient between datasetσ and model ∆micro for measured attributes p-value for slopelt 0001 for all transformations
To test the generality of our findings across model archi-tecture we ran similar experiments on StyleGAN in whichthe latent space is divided into two spaces z andW As [13]notes that theW space is less entangled than z we apply thelinear walk to W and show results in Fig 8 and B5 Oneinteresting aspect of StyleGAN is that we can change colorwhile leaving other structure in the image unchanged Inother words while green faces do not naturally exist in thedataset the StyleGAN model is still able to generate themThis differs from the behavior of BigGAN where changing
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
- Luminance +
- Red +
- Green +
- Blue +Figure 8 Distribution for luminance transformation learnedfrom the StyleGAN cars generator and qualitative exam-ples of color transformations on various datasets usingStyleGAN
color results in different semantics in the image eg turn-ing a dormant volcano to an active one or a daytime sceneto a nighttime one StyleGAN however does not preservethe exact geometry of objects under other transformationseg zoom and shift (see Sec B5)
5 Conclusion
GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much
AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx
Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC
References[1] A Amini A Soleimany W Schwarting S Bhatia and
D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2
[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2
[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3
[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4
[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3
[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3
[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2
[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2
[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3
[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3
[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3
[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2
[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12
[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24
[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10
[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2
[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3
[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4
[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2
[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2
[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12
[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3
[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2
[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2
[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11
[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3
Appendix
A Method detailsA1 Optimization for the linear walk
We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model
A2 Implementation Details for Linear Walk
We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below
Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage
Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene
Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB
where we jointly learn the three walk directions wR wGand wB
Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget
Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing
A3 Linear NN(z) walk
Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z
Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is
edit(G(z) 2α) = edit(edit(G(z) α) α)
To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows
z1 = z + αF (z)
z2 = z1 + αF (z1)
However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α
z + 2αF (z) = z1 + αF (z1) (4)
Therefore
z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)
Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z
A4 Optimization for the non-linear walk
Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)
We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps
B Additional ExperimentsB1 Model and Data Distributions
How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data
B2 Quantifying Transformation Limits
We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases
Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent
trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)
B3 Detected Bounding Boxes
To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis
B4 Alternative Walks in BigGAN
B41 LPIPS objective
In the main text we learn the latent space walk w by mini-mizing the objective function
J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)
using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)
B42 Non-linear Walks
Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic
2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API
B43 Additional Qualitative Examples
We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively
B5 Walks in StyleGAN
We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses
in Figs 20 22 24
B6 Qualitative examples for additional transfor-mations
Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)
00 02 04 06 08 10Area
05
10
15
20
PD
F
105
data
model
00 02 04 06 08 10Center Y
0
1
2
PD
F
102
data
model
00 02 04 06 08 10Center X
00
05
10
15
20
PD
F
102
data
model
00 02 04 06 08 10Pixel Intensity
2
3
4
5
PD
F
103
data
model
ZoomShift YShift XLuminance
Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset
Luminance
Rotate 2D
Shift X Shift Y
Zoom
10 05 00 05 10crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
50 0 50crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
2 1 0 1 2log(crarr)
00
02
04
06
08
Per
ceptu
alD
ista
nce
Rotate 3D
500 250 0 250 500crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude
Figure 11 Bounding boxes for random selected classes using ImageNet training images
Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α
Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)
Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk
10 05 00 05 10crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100In
ters
ection
2 0 2log(crarr)
000
025
050
075
100
Inte
rsec
tion
020 025 030Data
01
02
03
04
05
Model
micro
010 015 020Data
00
02
04
Model
micro
005 010 015Data
01
02
03
04
Model
micro
02 03Data
02
04
06
08
Model
micro
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
10 05 00 05 10crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
2 0 2log(crarr)
20
30
40
50
FID
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25P
DF
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
p=049 p=049 p=055 p=060
Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation
- Color +- Zoom +
- Shift X + - Shift Y +
- Rotate 2D + - Rotate 3D +
Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
- Rotate 2D +058 005005
- Shift Y +058 013021
032 003001- Zoom +
Figure 3 Transformation limits As we increase the magnitude of the learned walk vector we reach a limit to how muchwe can achieve a given transformation The learned operation either ceases to transform the image any further or the imagestarts to deviate from the natural image manifold Below each figure we also indicate the average LPIPS perceptual distancebetween 200 sampled image pairs of that category Perceptual distance decreases as we move farther from the source (centerimage) which indicates that the images are converging
We use this function composition approach rather than thesimpler setup ofG(z+αNN(z)) because the latter learns toignore the input z when α takes on continuous values andis thus equivalent to the previous linear trajectory (see A3for further details)
Quantifying Steerability We further seek to quantify theextent to which we are able to achieve the desired imagemanipulations with respect to each transformation To thisend we compare the distribution of a given attribute egldquoluminancerdquo in the dataset versus in images generated afterwalking in latent space
For color transformations we consider the effect of in-creasing or decreasing the α coefficient corresponding toeach color channel To estimate the color distribution ofmodel-generated images we randomly sample N = 100pixels per image both before and after taking a step in latentspace Then we compute the pixel value for each channelor the mean RGB value for luminance and normalize therange between 0 and 1
For zoom and shift transformations we rely on an ob-ject detector which captures the central object in the imageclass We use a MobileNet-SSD v1 [18] detector to esti-mate object bounding boxes and restrict ourselves to imageclasses recognizable by the detector For each successfuldetection we take the highest probability bounding box cor-responding to the desired class and use that to quantify thedegree to which we are able to achieve each transformationFor the zoom operation we use the area of the boundingbox normalized by the area of the total image For shift in
the X and Y directions we take the center X and Y coordi-nates of the bounding box and normalize by image widthor height
Truncation parameters in GANs (as used in [4 13]) tradeoff between variety of the generated images and samplequality When comparing generated images to the datasetdistribution we use the largest possible truncation we usethe largest possible truncation for the model and performsimilar cropping and resizing of the dataset as done dur-ing model training (see [4]) When comparing the attributesof generated distributions under different α magnitudes toeach other but not to the dataset we reduce truncation to05 to ensure better performance of the object detector
4 ExperimentsWe demonstrate our approach using BigGAN [4] a
class-conditional GAN trained on 1000 ImageNet cate-gories We learn a shared latent space walk by averagingacross the image categories and further quantify how thiswalk affects each class differently We focus on linear walksin latent space for the main text and show additional resultson nonlinear walks in Sec 43 and B42 We also conductexperiments on StyleGAN [13] which uses an uncondi-tional style-based generator architecture in Sec 43 and B5
41 What image transformations can we achieve bysteering in the latent space
We show qualitative results of the learned transforma-tions in Fig 1 By walking in the generator latent spacewe are able to learn a variety of transformations on a given
source image (shown in the center image of each transfor-mation) Interestingly several priors come into play whenlearning these image transformations When we shift adaisy downwards in the Y direction the model hallucinatesthat the sky exists on the top of the image However whenwe shift the daisy up the model inpaints the remainder ofthe image with grass When we alter the brightness of a im-age the model transitions between nighttime and daytimeThis suggests that the model is able to extrapolate from theoriginal source image and yet remain consistent with theimage context
We further ask the question how much are we able toachieve each transformation When we increase the stepsize of α we observe that the extent to which we canachieve each transformation is limited In Fig 3 we ob-serve two potential failure cases one in which the the im-age becomes unrealistic and the other in which the imagefails to transform any further When we try to zoom in ona Persian cat we observe that the cat no longer increases insize beyond some point In fact the size of the cat underthe latent space transformation consistently undershoots thetarget zoom On the other hand when we try to zoom out onthe cat we observe that it begins to fall off the image man-ifold and at some point is unable to become any smallerIndeed the perceptual distance (using LPIPS) between im-ages decreases as we push α towards the transformationlimits Similar trends hold with other transformations weare able to shift a lorikeet up and down to some degree untilthe transformation fails to have any further effect and yieldsunrealistic output and despite adjusting α on the rotationvector we are unable to rotate a pizza Are the limitationsto these transformations governed by the training datasetIn other words are our latent space walks limited becausein ImageNet photos the cats are mostly centered and takenwithin a certain size We seek to investigate and quantifythese biases in the next sections
An intriguing property of the learned trajectory is thatthe degree to which it affects the output depends on the im-age class In Fig 4 we investigate the impact of the walkfor different image categories under color transformationsBy stepping w in the direction of increasing redness in theimage we are able to successfully recolor a jellyfish but weare unable to change the color of a goldfinch ndash the goldfinchremains yellow and the only effect of w is to slightly red-den the background Likewise increasing the brightness ofa scene transforms an erupting volcano to a dormant vol-cano but does not have such effect on Alps which onlytransitions between night and day In the third example weuse our latent walk to turn red sports cars to blue When weapply this vector to firetrucks the only effect is to slightlychange the color of the sky Again perceptual distance overmultiple image samples per class confirms these qualitativeobservations a 2-sample t-test yields t = 2077 p lt 0001
for jellyfishgoldfinch t = 814 p lt 0001 for volcanoalpand t = 684 p lt 0001 for sports carfire engine Wehypothesize that the differential impact of the same trans-formation on various image classes relates to the variabilityin the underlying dataset Firetrucks only appear in redbut sports cars appear in a variety of colors Therefore ourcolor transformation is constrained by the biases exhibitedin individual classes in the dataset
- Blueness +
- Darkness +
- Redness +jellyfish goldfinch
00
05
10
Per
ceptu
alD
ista
nce
sports car fire engine00
02
04
Per
ceptu
alD
ista
nce
volcano alp00
05
Per
ceptu
alD
ista
nce
Figure 4 Each pair of rows shows how a single latent direc-tion w affects two different ImageNet classes We observethat the same direction affects different classes differentlyand in a way consistent with semantic priors (eg ldquoVolca-noesrdquo explode ldquoAlpsrdquo do not) Boxplots show the LPIPSperceptual distance before and after transformation for 200samples per class
Motivated by our qualitative observations in Fig 1Fig 3 and Fig 4 we next quantitatively investigate thedegree to which we are able to achieve certain transforma-tions
Limits of Transformations In quantifying the extent ofour transformations we are bounded by two opposing fac-tors how much we can transform the distribution while alsoremaining within the manifold of realistic looking images
For the color (luminance) transformation we observethat by changing α we shift the proportion of light anddark colored pixels and can do so without large increases inFID (Fig 5 first column) However the extent to which wecan move the distribution is limited and each transformeddistribution still retains some probability mass over the fullrange of pixel intensities
10 05 00 05 10crarr
00
05
10
Inte
rsec
tion
200 100 0 100 200crarr
00
05
10
Inte
rsec
tion
200 100 0 100 200crarr
00
05
10
Inte
rsec
tion
2 0 2log(crarr)
00
05
10
Inte
rsec
tion
00 02 04 06 08 10Pixel Intensity
000
025
050
075
100
125
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
20
25
30
PD
F
105
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
10 05 00 05 10crarr
20
40
FID
200 100 0 100 200crarr
20
40
FID
200 100 0 100 200crarr
20
40
FID
2 0 2log(crarr)
20
40
FID
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Pixel Intensity
000
025
050
075
100
125
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
20
25
30
PD
F
105
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
Figure 5 Quantifying the extent of transformations We compare the attributes of generated images under the raw modeloutput G(z) compared to the distribution under a learned transformation model(α) We measure the intersection betweenG(z) and model(α) and also compute the FID on the transformed image to limit our transformations to the natural imagemanifold
Using the shift vector we show that we are able to movethe distribution of the center object by changing α In theunderlying model the center coordinate of the object ismost concentrated at half of the image width and height butafter applying the shift in X and shift in Y transformationthe mode of the transformed distribution varies between 03and 07 of the image widthheight To quantify the extentto which the distribution changes we compute the area ofintersection between the original model distribution and thedistribution after applying the transformation and we ob-serve that the intersection decreases as we increase or de-crease the magnitude of α However our transformationsare limited to a certain extent ndash if we increase α beyond 150pixels for vertical shifts we start to generate unrealistic im-ages as evidenced by a sharp rise in FID and convergingmodes in the transformed distributions (Fig 5 columns 2 amp3)
We perform a similar procedure for the zoom operationby measuring the area of the bounding box for the detectedobject under different magnitudes of α Like shift we ob-serve that subsequent increases in α magnitude start to havesmaller and smaller effects on the mode of the resulting dis-tribution (Fig 5 last column) Past an 8x zoom in or outwe observe an increase in the FID of the transformed out-put signifying decreasing image quality Interestingly for
zoom the FID under zooming in and zooming out is anti-symmetric indicating that the success to which we are ableto achieve the zoom-in operation under the constraint of re-alistic looking images differs from that of zooming out
These trends are consistent with the plateau in trans-formation behavior that we qualitatively observe in Fig 3Although we can arbitrarily increase the α step size aftersome point we are unable to achieve further transformationand risk deviating from the natural image manifold
42 How does the data affect the transformations
How do the statistics of the data distribution affect theability to achieve these transformations In other words isthe extent to which we can transform each class as we ob-served in Fig 4 due to limited variability in the underlyingdataset for each class
One intuitive way of quantifying the amount of transfor-mation achieved dependent on the variability in the datasetis to measure the difference in transformed model meansmodel(+α) and model(-α) and compare it to thespread of the dataset distribution For each class we com-pute standard deviation of the dataset with respect to ourstatistic of interest (pixel RGB value for color and bound-ing box area and center value for zoom and shift transfor-mations respectively) We hypothesize that if the amount of
000 025 050 075 100
Area
0
2
4
P(A
)
105
000 025 050 075 100
Area
0
2
4
P(A
)
105
020 025 030
Data
00
01
02
03
Model
micro
Zoom
r = 037 p lt 0001
005 010 015
Data
00
01
02
03
Model
micro
Shift Y
r = 039 p lt 0001
010 015 020
Data
00
01
02
03
Model
micro
Shift X
r = 028 p lt 0001
020 025 030
Data
00
01
02
03
Model
micro
Luminance
r = 059 p lt 0001
- Zoom +
Rob
inLa
ptop
Figure 6 Understanding per-class biases We observe a correlation between the variability in the training data for ImageNetclasses and our ability to shift the distribution under latent space transformations Classes with low variability (eg robin)limit our ability to achieve desired transformations in comparison to classes with a broad dataset distribution (eg laptop)
transformation is biased depending on the image class wewill observe a correlation between the distance of the meanshifts and the standard deviation of the data distribution
More concretely we define the change in model meansunder a given transformation as
∆microk =∥∥microkmodel(+αlowast) minus microkmodel(-αlowast)
∥∥2 (3)
for a given class k under the optimal walk for each trans-formation wlowast under Eq 5 and we set αlowast to be largest andsmallest α values used in training We note that the degreeto which we achieve each transformation is a function of αso we use the same value of α across all classes ndash one thatis large enough to separate the means of microkmodel(αlowast) andmicrokmodel(-αlowast) under transformation but also for which theFID of the generated distribution remains below a thresholdT of generating reasonably realistic images (for our experi-ments we use T = 22)
We apply this measure in Fig 6 where on the x-axis weplot the standard deviation σ of the dataset and on the y-axis we show the model ∆micro under a +αlowast andminusαlowast transfor-mation as defined in Eq 3 We sample randomly from 100classes for the color zoom and shift transformations andgenerate 200 samples of each class under the positive andnegative transformations We use the same setup of draw-ing samples from the model and dataset and computing the
statistics for each transformation as described in Sec 41Indeed we find that the width of the dataset distribu-
tion captured by the standard deviation of random samplesdrawn from the dataset for each class is related to the extentto which we achieve each transformation For these trans-formations we observe a positive correlation between thespread of the dataset and the magnitude of ∆micro observedin the transformed model distributions Furthermore theslope of all observed trends differs significantly from zero(p lt 0001 for all transformations)
For the zoom transformation we show examples of twoextremes along the trend We show an example of theldquorobinrdquo class in which the spread σ in the dataset is low andsubsequently the separation ∆micro that we are able to achieveby applying +αlowast and minusαlowast transformations is limited Onthe other hand in the ldquolaptoprdquo class the underlying spreadin the dataset is broad ImageNet contains images of laptopsof various sizes and we are able to attain a wider shift in themodel distribution when taking a αlowast-sized step in the latentspace
From these results we conclude that the extent to whichwe achieve each transformation is related to the variabil-ity that inherently exists in the dataset Consistent with ourqualitative observations in Fig 4 we find that if the im-ages for a particular class have adequate coverage over the
- Zoom + - Zoom +2x 4x
Linear Lpips
8x1x05x025x0125x 2x 4x 8x1x05x025x0125x
Linear L2 Non-linear L2
Non-linear Lpips
Figure 7 Comparison of linear and nonlinear walks for the zoom operation The linear walk undershoots the targeted levelof transformation but maintains more realistic output
entire range of a given transformation then we are betterable to move the model distribution to both extremes Onthe other hand if the images for the given class exhibit lessvariability the extent of our transformation is limited by thisproperty of the dataset
43 Alternative architectures and walks
We ran an identical set of experiments using the nonlin-ear walk in the BigGAN latent space and obtain similarquantitative results To summarize the Pearsonrsquos correla-tion coefficient between dataset σ and model ∆micro for linearwalks and nonlinear walks is shown in Table 1 and fullresults in B42 Qualitatively we observe that while thelinear trajectory undershoots the targeted level of transfor-mation it is able to preserve more realistic-looking results(Fig 7) The transformations involve a trade-off betweenminimizing the loss and maintaining realistic output andwe hypothesize that the linear walk functions as an implicitregularizer that corresponds well with the inherent organi-zation of the latent space
Luminance Shift X Shift Y ZoomLinear 059 028 039 037
Non-linear 049 049 055 060
Table 1 Pearsonrsquos correlation coefficient between datasetσ and model ∆micro for measured attributes p-value for slopelt 0001 for all transformations
To test the generality of our findings across model archi-tecture we ran similar experiments on StyleGAN in whichthe latent space is divided into two spaces z andW As [13]notes that theW space is less entangled than z we apply thelinear walk to W and show results in Fig 8 and B5 Oneinteresting aspect of StyleGAN is that we can change colorwhile leaving other structure in the image unchanged Inother words while green faces do not naturally exist in thedataset the StyleGAN model is still able to generate themThis differs from the behavior of BigGAN where changing
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
- Luminance +
- Red +
- Green +
- Blue +Figure 8 Distribution for luminance transformation learnedfrom the StyleGAN cars generator and qualitative exam-ples of color transformations on various datasets usingStyleGAN
color results in different semantics in the image eg turn-ing a dormant volcano to an active one or a daytime sceneto a nighttime one StyleGAN however does not preservethe exact geometry of objects under other transformationseg zoom and shift (see Sec B5)
5 Conclusion
GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much
AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx
Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC
References[1] A Amini A Soleimany W Schwarting S Bhatia and
D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2
[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2
[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3
[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4
[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3
[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3
[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2
[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2
[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3
[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3
[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3
[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2
[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12
[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24
[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10
[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2
[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3
[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4
[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2
[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2
[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12
[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3
[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2
[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2
[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11
[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3
Appendix
A Method detailsA1 Optimization for the linear walk
We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model
A2 Implementation Details for Linear Walk
We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below
Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage
Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene
Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB
where we jointly learn the three walk directions wR wGand wB
Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget
Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing
A3 Linear NN(z) walk
Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z
Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is
edit(G(z) 2α) = edit(edit(G(z) α) α)
To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows
z1 = z + αF (z)
z2 = z1 + αF (z1)
However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α
z + 2αF (z) = z1 + αF (z1) (4)
Therefore
z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)
Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z
A4 Optimization for the non-linear walk
Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)
We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps
B Additional ExperimentsB1 Model and Data Distributions
How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data
B2 Quantifying Transformation Limits
We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases
Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent
trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)
B3 Detected Bounding Boxes
To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis
B4 Alternative Walks in BigGAN
B41 LPIPS objective
In the main text we learn the latent space walk w by mini-mizing the objective function
J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)
using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)
B42 Non-linear Walks
Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic
2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API
B43 Additional Qualitative Examples
We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively
B5 Walks in StyleGAN
We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses
in Figs 20 22 24
B6 Qualitative examples for additional transfor-mations
Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)
00 02 04 06 08 10Area
05
10
15
20
PD
F
105
data
model
00 02 04 06 08 10Center Y
0
1
2
PD
F
102
data
model
00 02 04 06 08 10Center X
00
05
10
15
20
PD
F
102
data
model
00 02 04 06 08 10Pixel Intensity
2
3
4
5
PD
F
103
data
model
ZoomShift YShift XLuminance
Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset
Luminance
Rotate 2D
Shift X Shift Y
Zoom
10 05 00 05 10crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
50 0 50crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
2 1 0 1 2log(crarr)
00
02
04
06
08
Per
ceptu
alD
ista
nce
Rotate 3D
500 250 0 250 500crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude
Figure 11 Bounding boxes for random selected classes using ImageNet training images
Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α
Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)
Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk
10 05 00 05 10crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100In
ters
ection
2 0 2log(crarr)
000
025
050
075
100
Inte
rsec
tion
020 025 030Data
01
02
03
04
05
Model
micro
010 015 020Data
00
02
04
Model
micro
005 010 015Data
01
02
03
04
Model
micro
02 03Data
02
04
06
08
Model
micro
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
10 05 00 05 10crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
2 0 2log(crarr)
20
30
40
50
FID
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25P
DF
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
p=049 p=049 p=055 p=060
Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation
- Color +- Zoom +
- Shift X + - Shift Y +
- Rotate 2D + - Rotate 3D +
Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
source image (shown in the center image of each transfor-mation) Interestingly several priors come into play whenlearning these image transformations When we shift adaisy downwards in the Y direction the model hallucinatesthat the sky exists on the top of the image However whenwe shift the daisy up the model inpaints the remainder ofthe image with grass When we alter the brightness of a im-age the model transitions between nighttime and daytimeThis suggests that the model is able to extrapolate from theoriginal source image and yet remain consistent with theimage context
We further ask the question how much are we able toachieve each transformation When we increase the stepsize of α we observe that the extent to which we canachieve each transformation is limited In Fig 3 we ob-serve two potential failure cases one in which the the im-age becomes unrealistic and the other in which the imagefails to transform any further When we try to zoom in ona Persian cat we observe that the cat no longer increases insize beyond some point In fact the size of the cat underthe latent space transformation consistently undershoots thetarget zoom On the other hand when we try to zoom out onthe cat we observe that it begins to fall off the image man-ifold and at some point is unable to become any smallerIndeed the perceptual distance (using LPIPS) between im-ages decreases as we push α towards the transformationlimits Similar trends hold with other transformations weare able to shift a lorikeet up and down to some degree untilthe transformation fails to have any further effect and yieldsunrealistic output and despite adjusting α on the rotationvector we are unable to rotate a pizza Are the limitationsto these transformations governed by the training datasetIn other words are our latent space walks limited becausein ImageNet photos the cats are mostly centered and takenwithin a certain size We seek to investigate and quantifythese biases in the next sections
An intriguing property of the learned trajectory is thatthe degree to which it affects the output depends on the im-age class In Fig 4 we investigate the impact of the walkfor different image categories under color transformationsBy stepping w in the direction of increasing redness in theimage we are able to successfully recolor a jellyfish but weare unable to change the color of a goldfinch ndash the goldfinchremains yellow and the only effect of w is to slightly red-den the background Likewise increasing the brightness ofa scene transforms an erupting volcano to a dormant vol-cano but does not have such effect on Alps which onlytransitions between night and day In the third example weuse our latent walk to turn red sports cars to blue When weapply this vector to firetrucks the only effect is to slightlychange the color of the sky Again perceptual distance overmultiple image samples per class confirms these qualitativeobservations a 2-sample t-test yields t = 2077 p lt 0001
for jellyfishgoldfinch t = 814 p lt 0001 for volcanoalpand t = 684 p lt 0001 for sports carfire engine Wehypothesize that the differential impact of the same trans-formation on various image classes relates to the variabilityin the underlying dataset Firetrucks only appear in redbut sports cars appear in a variety of colors Therefore ourcolor transformation is constrained by the biases exhibitedin individual classes in the dataset
- Blueness +
- Darkness +
- Redness +jellyfish goldfinch
00
05
10
Per
ceptu
alD
ista
nce
sports car fire engine00
02
04
Per
ceptu
alD
ista
nce
volcano alp00
05
Per
ceptu
alD
ista
nce
Figure 4 Each pair of rows shows how a single latent direc-tion w affects two different ImageNet classes We observethat the same direction affects different classes differentlyand in a way consistent with semantic priors (eg ldquoVolca-noesrdquo explode ldquoAlpsrdquo do not) Boxplots show the LPIPSperceptual distance before and after transformation for 200samples per class
Motivated by our qualitative observations in Fig 1Fig 3 and Fig 4 we next quantitatively investigate thedegree to which we are able to achieve certain transforma-tions
Limits of Transformations In quantifying the extent ofour transformations we are bounded by two opposing fac-tors how much we can transform the distribution while alsoremaining within the manifold of realistic looking images
For the color (luminance) transformation we observethat by changing α we shift the proportion of light anddark colored pixels and can do so without large increases inFID (Fig 5 first column) However the extent to which wecan move the distribution is limited and each transformeddistribution still retains some probability mass over the fullrange of pixel intensities
10 05 00 05 10crarr
00
05
10
Inte
rsec
tion
200 100 0 100 200crarr
00
05
10
Inte
rsec
tion
200 100 0 100 200crarr
00
05
10
Inte
rsec
tion
2 0 2log(crarr)
00
05
10
Inte
rsec
tion
00 02 04 06 08 10Pixel Intensity
000
025
050
075
100
125
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
20
25
30
PD
F
105
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
10 05 00 05 10crarr
20
40
FID
200 100 0 100 200crarr
20
40
FID
200 100 0 100 200crarr
20
40
FID
2 0 2log(crarr)
20
40
FID
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Pixel Intensity
000
025
050
075
100
125
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
20
25
30
PD
F
105
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
Figure 5 Quantifying the extent of transformations We compare the attributes of generated images under the raw modeloutput G(z) compared to the distribution under a learned transformation model(α) We measure the intersection betweenG(z) and model(α) and also compute the FID on the transformed image to limit our transformations to the natural imagemanifold
Using the shift vector we show that we are able to movethe distribution of the center object by changing α In theunderlying model the center coordinate of the object ismost concentrated at half of the image width and height butafter applying the shift in X and shift in Y transformationthe mode of the transformed distribution varies between 03and 07 of the image widthheight To quantify the extentto which the distribution changes we compute the area ofintersection between the original model distribution and thedistribution after applying the transformation and we ob-serve that the intersection decreases as we increase or de-crease the magnitude of α However our transformationsare limited to a certain extent ndash if we increase α beyond 150pixels for vertical shifts we start to generate unrealistic im-ages as evidenced by a sharp rise in FID and convergingmodes in the transformed distributions (Fig 5 columns 2 amp3)
We perform a similar procedure for the zoom operationby measuring the area of the bounding box for the detectedobject under different magnitudes of α Like shift we ob-serve that subsequent increases in α magnitude start to havesmaller and smaller effects on the mode of the resulting dis-tribution (Fig 5 last column) Past an 8x zoom in or outwe observe an increase in the FID of the transformed out-put signifying decreasing image quality Interestingly for
zoom the FID under zooming in and zooming out is anti-symmetric indicating that the success to which we are ableto achieve the zoom-in operation under the constraint of re-alistic looking images differs from that of zooming out
These trends are consistent with the plateau in trans-formation behavior that we qualitatively observe in Fig 3Although we can arbitrarily increase the α step size aftersome point we are unable to achieve further transformationand risk deviating from the natural image manifold
42 How does the data affect the transformations
How do the statistics of the data distribution affect theability to achieve these transformations In other words isthe extent to which we can transform each class as we ob-served in Fig 4 due to limited variability in the underlyingdataset for each class
One intuitive way of quantifying the amount of transfor-mation achieved dependent on the variability in the datasetis to measure the difference in transformed model meansmodel(+α) and model(-α) and compare it to thespread of the dataset distribution For each class we com-pute standard deviation of the dataset with respect to ourstatistic of interest (pixel RGB value for color and bound-ing box area and center value for zoom and shift transfor-mations respectively) We hypothesize that if the amount of
000 025 050 075 100
Area
0
2
4
P(A
)
105
000 025 050 075 100
Area
0
2
4
P(A
)
105
020 025 030
Data
00
01
02
03
Model
micro
Zoom
r = 037 p lt 0001
005 010 015
Data
00
01
02
03
Model
micro
Shift Y
r = 039 p lt 0001
010 015 020
Data
00
01
02
03
Model
micro
Shift X
r = 028 p lt 0001
020 025 030
Data
00
01
02
03
Model
micro
Luminance
r = 059 p lt 0001
- Zoom +
Rob
inLa
ptop
Figure 6 Understanding per-class biases We observe a correlation between the variability in the training data for ImageNetclasses and our ability to shift the distribution under latent space transformations Classes with low variability (eg robin)limit our ability to achieve desired transformations in comparison to classes with a broad dataset distribution (eg laptop)
transformation is biased depending on the image class wewill observe a correlation between the distance of the meanshifts and the standard deviation of the data distribution
More concretely we define the change in model meansunder a given transformation as
∆microk =∥∥microkmodel(+αlowast) minus microkmodel(-αlowast)
∥∥2 (3)
for a given class k under the optimal walk for each trans-formation wlowast under Eq 5 and we set αlowast to be largest andsmallest α values used in training We note that the degreeto which we achieve each transformation is a function of αso we use the same value of α across all classes ndash one thatis large enough to separate the means of microkmodel(αlowast) andmicrokmodel(-αlowast) under transformation but also for which theFID of the generated distribution remains below a thresholdT of generating reasonably realistic images (for our experi-ments we use T = 22)
We apply this measure in Fig 6 where on the x-axis weplot the standard deviation σ of the dataset and on the y-axis we show the model ∆micro under a +αlowast andminusαlowast transfor-mation as defined in Eq 3 We sample randomly from 100classes for the color zoom and shift transformations andgenerate 200 samples of each class under the positive andnegative transformations We use the same setup of draw-ing samples from the model and dataset and computing the
statistics for each transformation as described in Sec 41Indeed we find that the width of the dataset distribu-
tion captured by the standard deviation of random samplesdrawn from the dataset for each class is related to the extentto which we achieve each transformation For these trans-formations we observe a positive correlation between thespread of the dataset and the magnitude of ∆micro observedin the transformed model distributions Furthermore theslope of all observed trends differs significantly from zero(p lt 0001 for all transformations)
For the zoom transformation we show examples of twoextremes along the trend We show an example of theldquorobinrdquo class in which the spread σ in the dataset is low andsubsequently the separation ∆micro that we are able to achieveby applying +αlowast and minusαlowast transformations is limited Onthe other hand in the ldquolaptoprdquo class the underlying spreadin the dataset is broad ImageNet contains images of laptopsof various sizes and we are able to attain a wider shift in themodel distribution when taking a αlowast-sized step in the latentspace
From these results we conclude that the extent to whichwe achieve each transformation is related to the variabil-ity that inherently exists in the dataset Consistent with ourqualitative observations in Fig 4 we find that if the im-ages for a particular class have adequate coverage over the
- Zoom + - Zoom +2x 4x
Linear Lpips
8x1x05x025x0125x 2x 4x 8x1x05x025x0125x
Linear L2 Non-linear L2
Non-linear Lpips
Figure 7 Comparison of linear and nonlinear walks for the zoom operation The linear walk undershoots the targeted levelof transformation but maintains more realistic output
entire range of a given transformation then we are betterable to move the model distribution to both extremes Onthe other hand if the images for the given class exhibit lessvariability the extent of our transformation is limited by thisproperty of the dataset
43 Alternative architectures and walks
We ran an identical set of experiments using the nonlin-ear walk in the BigGAN latent space and obtain similarquantitative results To summarize the Pearsonrsquos correla-tion coefficient between dataset σ and model ∆micro for linearwalks and nonlinear walks is shown in Table 1 and fullresults in B42 Qualitatively we observe that while thelinear trajectory undershoots the targeted level of transfor-mation it is able to preserve more realistic-looking results(Fig 7) The transformations involve a trade-off betweenminimizing the loss and maintaining realistic output andwe hypothesize that the linear walk functions as an implicitregularizer that corresponds well with the inherent organi-zation of the latent space
Luminance Shift X Shift Y ZoomLinear 059 028 039 037
Non-linear 049 049 055 060
Table 1 Pearsonrsquos correlation coefficient between datasetσ and model ∆micro for measured attributes p-value for slopelt 0001 for all transformations
To test the generality of our findings across model archi-tecture we ran similar experiments on StyleGAN in whichthe latent space is divided into two spaces z andW As [13]notes that theW space is less entangled than z we apply thelinear walk to W and show results in Fig 8 and B5 Oneinteresting aspect of StyleGAN is that we can change colorwhile leaving other structure in the image unchanged Inother words while green faces do not naturally exist in thedataset the StyleGAN model is still able to generate themThis differs from the behavior of BigGAN where changing
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
- Luminance +
- Red +
- Green +
- Blue +Figure 8 Distribution for luminance transformation learnedfrom the StyleGAN cars generator and qualitative exam-ples of color transformations on various datasets usingStyleGAN
color results in different semantics in the image eg turn-ing a dormant volcano to an active one or a daytime sceneto a nighttime one StyleGAN however does not preservethe exact geometry of objects under other transformationseg zoom and shift (see Sec B5)
5 Conclusion
GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much
AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx
Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC
References[1] A Amini A Soleimany W Schwarting S Bhatia and
D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2
[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2
[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3
[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4
[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3
[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3
[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2
[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2
[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3
[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3
[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3
[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2
[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12
[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24
[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10
[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2
[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3
[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4
[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2
[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2
[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12
[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3
[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2
[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2
[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11
[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3
Appendix
A Method detailsA1 Optimization for the linear walk
We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model
A2 Implementation Details for Linear Walk
We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below
Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage
Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene
Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB
where we jointly learn the three walk directions wR wGand wB
Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget
Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing
A3 Linear NN(z) walk
Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z
Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is
edit(G(z) 2α) = edit(edit(G(z) α) α)
To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows
z1 = z + αF (z)
z2 = z1 + αF (z1)
However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α
z + 2αF (z) = z1 + αF (z1) (4)
Therefore
z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)
Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z
A4 Optimization for the non-linear walk
Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)
We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps
B Additional ExperimentsB1 Model and Data Distributions
How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data
B2 Quantifying Transformation Limits
We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases
Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent
trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)
B3 Detected Bounding Boxes
To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis
B4 Alternative Walks in BigGAN
B41 LPIPS objective
In the main text we learn the latent space walk w by mini-mizing the objective function
J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)
using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)
B42 Non-linear Walks
Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic
2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API
B43 Additional Qualitative Examples
We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively
B5 Walks in StyleGAN
We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses
in Figs 20 22 24
B6 Qualitative examples for additional transfor-mations
Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)
00 02 04 06 08 10Area
05
10
15
20
PD
F
105
data
model
00 02 04 06 08 10Center Y
0
1
2
PD
F
102
data
model
00 02 04 06 08 10Center X
00
05
10
15
20
PD
F
102
data
model
00 02 04 06 08 10Pixel Intensity
2
3
4
5
PD
F
103
data
model
ZoomShift YShift XLuminance
Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset
Luminance
Rotate 2D
Shift X Shift Y
Zoom
10 05 00 05 10crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
50 0 50crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
2 1 0 1 2log(crarr)
00
02
04
06
08
Per
ceptu
alD
ista
nce
Rotate 3D
500 250 0 250 500crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude
Figure 11 Bounding boxes for random selected classes using ImageNet training images
Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α
Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)
Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk
10 05 00 05 10crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100In
ters
ection
2 0 2log(crarr)
000
025
050
075
100
Inte
rsec
tion
020 025 030Data
01
02
03
04
05
Model
micro
010 015 020Data
00
02
04
Model
micro
005 010 015Data
01
02
03
04
Model
micro
02 03Data
02
04
06
08
Model
micro
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
10 05 00 05 10crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
2 0 2log(crarr)
20
30
40
50
FID
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25P
DF
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
p=049 p=049 p=055 p=060
Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation
- Color +- Zoom +
- Shift X + - Shift Y +
- Rotate 2D + - Rotate 3D +
Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
10 05 00 05 10crarr
00
05
10
Inte
rsec
tion
200 100 0 100 200crarr
00
05
10
Inte
rsec
tion
200 100 0 100 200crarr
00
05
10
Inte
rsec
tion
2 0 2log(crarr)
00
05
10
Inte
rsec
tion
00 02 04 06 08 10Pixel Intensity
000
025
050
075
100
125
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
20
25
30
PD
F
105
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
10 05 00 05 10crarr
20
40
FID
200 100 0 100 200crarr
20
40
FID
200 100 0 100 200crarr
20
40
FID
2 0 2log(crarr)
20
40
FID
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Pixel Intensity
000
025
050
075
100
125
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
20
25
30
PD
F
105
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
Figure 5 Quantifying the extent of transformations We compare the attributes of generated images under the raw modeloutput G(z) compared to the distribution under a learned transformation model(α) We measure the intersection betweenG(z) and model(α) and also compute the FID on the transformed image to limit our transformations to the natural imagemanifold
Using the shift vector we show that we are able to movethe distribution of the center object by changing α In theunderlying model the center coordinate of the object ismost concentrated at half of the image width and height butafter applying the shift in X and shift in Y transformationthe mode of the transformed distribution varies between 03and 07 of the image widthheight To quantify the extentto which the distribution changes we compute the area ofintersection between the original model distribution and thedistribution after applying the transformation and we ob-serve that the intersection decreases as we increase or de-crease the magnitude of α However our transformationsare limited to a certain extent ndash if we increase α beyond 150pixels for vertical shifts we start to generate unrealistic im-ages as evidenced by a sharp rise in FID and convergingmodes in the transformed distributions (Fig 5 columns 2 amp3)
We perform a similar procedure for the zoom operationby measuring the area of the bounding box for the detectedobject under different magnitudes of α Like shift we ob-serve that subsequent increases in α magnitude start to havesmaller and smaller effects on the mode of the resulting dis-tribution (Fig 5 last column) Past an 8x zoom in or outwe observe an increase in the FID of the transformed out-put signifying decreasing image quality Interestingly for
zoom the FID under zooming in and zooming out is anti-symmetric indicating that the success to which we are ableto achieve the zoom-in operation under the constraint of re-alistic looking images differs from that of zooming out
These trends are consistent with the plateau in trans-formation behavior that we qualitatively observe in Fig 3Although we can arbitrarily increase the α step size aftersome point we are unable to achieve further transformationand risk deviating from the natural image manifold
42 How does the data affect the transformations
How do the statistics of the data distribution affect theability to achieve these transformations In other words isthe extent to which we can transform each class as we ob-served in Fig 4 due to limited variability in the underlyingdataset for each class
One intuitive way of quantifying the amount of transfor-mation achieved dependent on the variability in the datasetis to measure the difference in transformed model meansmodel(+α) and model(-α) and compare it to thespread of the dataset distribution For each class we com-pute standard deviation of the dataset with respect to ourstatistic of interest (pixel RGB value for color and bound-ing box area and center value for zoom and shift transfor-mations respectively) We hypothesize that if the amount of
000 025 050 075 100
Area
0
2
4
P(A
)
105
000 025 050 075 100
Area
0
2
4
P(A
)
105
020 025 030
Data
00
01
02
03
Model
micro
Zoom
r = 037 p lt 0001
005 010 015
Data
00
01
02
03
Model
micro
Shift Y
r = 039 p lt 0001
010 015 020
Data
00
01
02
03
Model
micro
Shift X
r = 028 p lt 0001
020 025 030
Data
00
01
02
03
Model
micro
Luminance
r = 059 p lt 0001
- Zoom +
Rob
inLa
ptop
Figure 6 Understanding per-class biases We observe a correlation between the variability in the training data for ImageNetclasses and our ability to shift the distribution under latent space transformations Classes with low variability (eg robin)limit our ability to achieve desired transformations in comparison to classes with a broad dataset distribution (eg laptop)
transformation is biased depending on the image class wewill observe a correlation between the distance of the meanshifts and the standard deviation of the data distribution
More concretely we define the change in model meansunder a given transformation as
∆microk =∥∥microkmodel(+αlowast) minus microkmodel(-αlowast)
∥∥2 (3)
for a given class k under the optimal walk for each trans-formation wlowast under Eq 5 and we set αlowast to be largest andsmallest α values used in training We note that the degreeto which we achieve each transformation is a function of αso we use the same value of α across all classes ndash one thatis large enough to separate the means of microkmodel(αlowast) andmicrokmodel(-αlowast) under transformation but also for which theFID of the generated distribution remains below a thresholdT of generating reasonably realistic images (for our experi-ments we use T = 22)
We apply this measure in Fig 6 where on the x-axis weplot the standard deviation σ of the dataset and on the y-axis we show the model ∆micro under a +αlowast andminusαlowast transfor-mation as defined in Eq 3 We sample randomly from 100classes for the color zoom and shift transformations andgenerate 200 samples of each class under the positive andnegative transformations We use the same setup of draw-ing samples from the model and dataset and computing the
statistics for each transformation as described in Sec 41Indeed we find that the width of the dataset distribu-
tion captured by the standard deviation of random samplesdrawn from the dataset for each class is related to the extentto which we achieve each transformation For these trans-formations we observe a positive correlation between thespread of the dataset and the magnitude of ∆micro observedin the transformed model distributions Furthermore theslope of all observed trends differs significantly from zero(p lt 0001 for all transformations)
For the zoom transformation we show examples of twoextremes along the trend We show an example of theldquorobinrdquo class in which the spread σ in the dataset is low andsubsequently the separation ∆micro that we are able to achieveby applying +αlowast and minusαlowast transformations is limited Onthe other hand in the ldquolaptoprdquo class the underlying spreadin the dataset is broad ImageNet contains images of laptopsof various sizes and we are able to attain a wider shift in themodel distribution when taking a αlowast-sized step in the latentspace
From these results we conclude that the extent to whichwe achieve each transformation is related to the variabil-ity that inherently exists in the dataset Consistent with ourqualitative observations in Fig 4 we find that if the im-ages for a particular class have adequate coverage over the
- Zoom + - Zoom +2x 4x
Linear Lpips
8x1x05x025x0125x 2x 4x 8x1x05x025x0125x
Linear L2 Non-linear L2
Non-linear Lpips
Figure 7 Comparison of linear and nonlinear walks for the zoom operation The linear walk undershoots the targeted levelof transformation but maintains more realistic output
entire range of a given transformation then we are betterable to move the model distribution to both extremes Onthe other hand if the images for the given class exhibit lessvariability the extent of our transformation is limited by thisproperty of the dataset
43 Alternative architectures and walks
We ran an identical set of experiments using the nonlin-ear walk in the BigGAN latent space and obtain similarquantitative results To summarize the Pearsonrsquos correla-tion coefficient between dataset σ and model ∆micro for linearwalks and nonlinear walks is shown in Table 1 and fullresults in B42 Qualitatively we observe that while thelinear trajectory undershoots the targeted level of transfor-mation it is able to preserve more realistic-looking results(Fig 7) The transformations involve a trade-off betweenminimizing the loss and maintaining realistic output andwe hypothesize that the linear walk functions as an implicitregularizer that corresponds well with the inherent organi-zation of the latent space
Luminance Shift X Shift Y ZoomLinear 059 028 039 037
Non-linear 049 049 055 060
Table 1 Pearsonrsquos correlation coefficient between datasetσ and model ∆micro for measured attributes p-value for slopelt 0001 for all transformations
To test the generality of our findings across model archi-tecture we ran similar experiments on StyleGAN in whichthe latent space is divided into two spaces z andW As [13]notes that theW space is less entangled than z we apply thelinear walk to W and show results in Fig 8 and B5 Oneinteresting aspect of StyleGAN is that we can change colorwhile leaving other structure in the image unchanged Inother words while green faces do not naturally exist in thedataset the StyleGAN model is still able to generate themThis differs from the behavior of BigGAN where changing
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
- Luminance +
- Red +
- Green +
- Blue +Figure 8 Distribution for luminance transformation learnedfrom the StyleGAN cars generator and qualitative exam-ples of color transformations on various datasets usingStyleGAN
color results in different semantics in the image eg turn-ing a dormant volcano to an active one or a daytime sceneto a nighttime one StyleGAN however does not preservethe exact geometry of objects under other transformationseg zoom and shift (see Sec B5)
5 Conclusion
GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much
AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx
Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC
References[1] A Amini A Soleimany W Schwarting S Bhatia and
D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2
[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2
[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3
[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4
[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3
[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3
[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2
[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2
[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3
[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3
[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3
[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2
[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12
[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24
[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10
[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2
[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3
[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4
[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2
[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2
[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12
[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3
[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2
[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2
[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11
[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3
Appendix
A Method detailsA1 Optimization for the linear walk
We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model
A2 Implementation Details for Linear Walk
We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below
Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage
Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene
Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB
where we jointly learn the three walk directions wR wGand wB
Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget
Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing
A3 Linear NN(z) walk
Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z
Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is
edit(G(z) 2α) = edit(edit(G(z) α) α)
To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows
z1 = z + αF (z)
z2 = z1 + αF (z1)
However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α
z + 2αF (z) = z1 + αF (z1) (4)
Therefore
z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)
Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z
A4 Optimization for the non-linear walk
Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)
We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps
B Additional ExperimentsB1 Model and Data Distributions
How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data
B2 Quantifying Transformation Limits
We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases
Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent
trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)
B3 Detected Bounding Boxes
To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis
B4 Alternative Walks in BigGAN
B41 LPIPS objective
In the main text we learn the latent space walk w by mini-mizing the objective function
J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)
using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)
B42 Non-linear Walks
Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic
2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API
B43 Additional Qualitative Examples
We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively
B5 Walks in StyleGAN
We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses
in Figs 20 22 24
B6 Qualitative examples for additional transfor-mations
Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)
00 02 04 06 08 10Area
05
10
15
20
PD
F
105
data
model
00 02 04 06 08 10Center Y
0
1
2
PD
F
102
data
model
00 02 04 06 08 10Center X
00
05
10
15
20
PD
F
102
data
model
00 02 04 06 08 10Pixel Intensity
2
3
4
5
PD
F
103
data
model
ZoomShift YShift XLuminance
Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset
Luminance
Rotate 2D
Shift X Shift Y
Zoom
10 05 00 05 10crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
50 0 50crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
2 1 0 1 2log(crarr)
00
02
04
06
08
Per
ceptu
alD
ista
nce
Rotate 3D
500 250 0 250 500crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude
Figure 11 Bounding boxes for random selected classes using ImageNet training images
Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α
Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)
Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk
10 05 00 05 10crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100In
ters
ection
2 0 2log(crarr)
000
025
050
075
100
Inte
rsec
tion
020 025 030Data
01
02
03
04
05
Model
micro
010 015 020Data
00
02
04
Model
micro
005 010 015Data
01
02
03
04
Model
micro
02 03Data
02
04
06
08
Model
micro
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
10 05 00 05 10crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
2 0 2log(crarr)
20
30
40
50
FID
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25P
DF
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
p=049 p=049 p=055 p=060
Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation
- Color +- Zoom +
- Shift X + - Shift Y +
- Rotate 2D + - Rotate 3D +
Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
000 025 050 075 100
Area
0
2
4
P(A
)
105
000 025 050 075 100
Area
0
2
4
P(A
)
105
020 025 030
Data
00
01
02
03
Model
micro
Zoom
r = 037 p lt 0001
005 010 015
Data
00
01
02
03
Model
micro
Shift Y
r = 039 p lt 0001
010 015 020
Data
00
01
02
03
Model
micro
Shift X
r = 028 p lt 0001
020 025 030
Data
00
01
02
03
Model
micro
Luminance
r = 059 p lt 0001
- Zoom +
Rob
inLa
ptop
Figure 6 Understanding per-class biases We observe a correlation between the variability in the training data for ImageNetclasses and our ability to shift the distribution under latent space transformations Classes with low variability (eg robin)limit our ability to achieve desired transformations in comparison to classes with a broad dataset distribution (eg laptop)
transformation is biased depending on the image class wewill observe a correlation between the distance of the meanshifts and the standard deviation of the data distribution
More concretely we define the change in model meansunder a given transformation as
∆microk =∥∥microkmodel(+αlowast) minus microkmodel(-αlowast)
∥∥2 (3)
for a given class k under the optimal walk for each trans-formation wlowast under Eq 5 and we set αlowast to be largest andsmallest α values used in training We note that the degreeto which we achieve each transformation is a function of αso we use the same value of α across all classes ndash one thatis large enough to separate the means of microkmodel(αlowast) andmicrokmodel(-αlowast) under transformation but also for which theFID of the generated distribution remains below a thresholdT of generating reasonably realistic images (for our experi-ments we use T = 22)
We apply this measure in Fig 6 where on the x-axis weplot the standard deviation σ of the dataset and on the y-axis we show the model ∆micro under a +αlowast andminusαlowast transfor-mation as defined in Eq 3 We sample randomly from 100classes for the color zoom and shift transformations andgenerate 200 samples of each class under the positive andnegative transformations We use the same setup of draw-ing samples from the model and dataset and computing the
statistics for each transformation as described in Sec 41Indeed we find that the width of the dataset distribu-
tion captured by the standard deviation of random samplesdrawn from the dataset for each class is related to the extentto which we achieve each transformation For these trans-formations we observe a positive correlation between thespread of the dataset and the magnitude of ∆micro observedin the transformed model distributions Furthermore theslope of all observed trends differs significantly from zero(p lt 0001 for all transformations)
For the zoom transformation we show examples of twoextremes along the trend We show an example of theldquorobinrdquo class in which the spread σ in the dataset is low andsubsequently the separation ∆micro that we are able to achieveby applying +αlowast and minusαlowast transformations is limited Onthe other hand in the ldquolaptoprdquo class the underlying spreadin the dataset is broad ImageNet contains images of laptopsof various sizes and we are able to attain a wider shift in themodel distribution when taking a αlowast-sized step in the latentspace
From these results we conclude that the extent to whichwe achieve each transformation is related to the variabil-ity that inherently exists in the dataset Consistent with ourqualitative observations in Fig 4 we find that if the im-ages for a particular class have adequate coverage over the
- Zoom + - Zoom +2x 4x
Linear Lpips
8x1x05x025x0125x 2x 4x 8x1x05x025x0125x
Linear L2 Non-linear L2
Non-linear Lpips
Figure 7 Comparison of linear and nonlinear walks for the zoom operation The linear walk undershoots the targeted levelof transformation but maintains more realistic output
entire range of a given transformation then we are betterable to move the model distribution to both extremes Onthe other hand if the images for the given class exhibit lessvariability the extent of our transformation is limited by thisproperty of the dataset
43 Alternative architectures and walks
We ran an identical set of experiments using the nonlin-ear walk in the BigGAN latent space and obtain similarquantitative results To summarize the Pearsonrsquos correla-tion coefficient between dataset σ and model ∆micro for linearwalks and nonlinear walks is shown in Table 1 and fullresults in B42 Qualitatively we observe that while thelinear trajectory undershoots the targeted level of transfor-mation it is able to preserve more realistic-looking results(Fig 7) The transformations involve a trade-off betweenminimizing the loss and maintaining realistic output andwe hypothesize that the linear walk functions as an implicitregularizer that corresponds well with the inherent organi-zation of the latent space
Luminance Shift X Shift Y ZoomLinear 059 028 039 037
Non-linear 049 049 055 060
Table 1 Pearsonrsquos correlation coefficient between datasetσ and model ∆micro for measured attributes p-value for slopelt 0001 for all transformations
To test the generality of our findings across model archi-tecture we ran similar experiments on StyleGAN in whichthe latent space is divided into two spaces z andW As [13]notes that theW space is less entangled than z we apply thelinear walk to W and show results in Fig 8 and B5 Oneinteresting aspect of StyleGAN is that we can change colorwhile leaving other structure in the image unchanged Inother words while green faces do not naturally exist in thedataset the StyleGAN model is still able to generate themThis differs from the behavior of BigGAN where changing
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
- Luminance +
- Red +
- Green +
- Blue +Figure 8 Distribution for luminance transformation learnedfrom the StyleGAN cars generator and qualitative exam-ples of color transformations on various datasets usingStyleGAN
color results in different semantics in the image eg turn-ing a dormant volcano to an active one or a daytime sceneto a nighttime one StyleGAN however does not preservethe exact geometry of objects under other transformationseg zoom and shift (see Sec B5)
5 Conclusion
GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much
AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx
Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC
References[1] A Amini A Soleimany W Schwarting S Bhatia and
D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2
[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2
[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3
[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4
[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3
[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3
[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2
[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2
[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3
[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3
[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3
[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2
[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12
[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24
[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10
[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2
[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3
[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4
[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2
[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2
[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12
[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3
[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2
[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2
[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11
[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3
Appendix
A Method detailsA1 Optimization for the linear walk
We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model
A2 Implementation Details for Linear Walk
We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below
Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage
Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene
Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB
where we jointly learn the three walk directions wR wGand wB
Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget
Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing
A3 Linear NN(z) walk
Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z
Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is
edit(G(z) 2α) = edit(edit(G(z) α) α)
To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows
z1 = z + αF (z)
z2 = z1 + αF (z1)
However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α
z + 2αF (z) = z1 + αF (z1) (4)
Therefore
z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)
Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z
A4 Optimization for the non-linear walk
Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)
We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps
B Additional ExperimentsB1 Model and Data Distributions
How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data
B2 Quantifying Transformation Limits
We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases
Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent
trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)
B3 Detected Bounding Boxes
To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis
B4 Alternative Walks in BigGAN
B41 LPIPS objective
In the main text we learn the latent space walk w by mini-mizing the objective function
J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)
using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)
B42 Non-linear Walks
Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic
2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API
B43 Additional Qualitative Examples
We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively
B5 Walks in StyleGAN
We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses
in Figs 20 22 24
B6 Qualitative examples for additional transfor-mations
Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)
00 02 04 06 08 10Area
05
10
15
20
PD
F
105
data
model
00 02 04 06 08 10Center Y
0
1
2
PD
F
102
data
model
00 02 04 06 08 10Center X
00
05
10
15
20
PD
F
102
data
model
00 02 04 06 08 10Pixel Intensity
2
3
4
5
PD
F
103
data
model
ZoomShift YShift XLuminance
Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset
Luminance
Rotate 2D
Shift X Shift Y
Zoom
10 05 00 05 10crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
50 0 50crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
2 1 0 1 2log(crarr)
00
02
04
06
08
Per
ceptu
alD
ista
nce
Rotate 3D
500 250 0 250 500crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude
Figure 11 Bounding boxes for random selected classes using ImageNet training images
Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α
Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)
Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk
10 05 00 05 10crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100In
ters
ection
2 0 2log(crarr)
000
025
050
075
100
Inte
rsec
tion
020 025 030Data
01
02
03
04
05
Model
micro
010 015 020Data
00
02
04
Model
micro
005 010 015Data
01
02
03
04
Model
micro
02 03Data
02
04
06
08
Model
micro
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
10 05 00 05 10crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
2 0 2log(crarr)
20
30
40
50
FID
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25P
DF
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
p=049 p=049 p=055 p=060
Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation
- Color +- Zoom +
- Shift X + - Shift Y +
- Rotate 2D + - Rotate 3D +
Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
- Zoom + - Zoom +2x 4x
Linear Lpips
8x1x05x025x0125x 2x 4x 8x1x05x025x0125x
Linear L2 Non-linear L2
Non-linear Lpips
Figure 7 Comparison of linear and nonlinear walks for the zoom operation The linear walk undershoots the targeted levelof transformation but maintains more realistic output
entire range of a given transformation then we are betterable to move the model distribution to both extremes Onthe other hand if the images for the given class exhibit lessvariability the extent of our transformation is limited by thisproperty of the dataset
43 Alternative architectures and walks
We ran an identical set of experiments using the nonlin-ear walk in the BigGAN latent space and obtain similarquantitative results To summarize the Pearsonrsquos correla-tion coefficient between dataset σ and model ∆micro for linearwalks and nonlinear walks is shown in Table 1 and fullresults in B42 Qualitatively we observe that while thelinear trajectory undershoots the targeted level of transfor-mation it is able to preserve more realistic-looking results(Fig 7) The transformations involve a trade-off betweenminimizing the loss and maintaining realistic output andwe hypothesize that the linear walk functions as an implicitregularizer that corresponds well with the inherent organi-zation of the latent space
Luminance Shift X Shift Y ZoomLinear 059 028 039 037
Non-linear 049 049 055 060
Table 1 Pearsonrsquos correlation coefficient between datasetσ and model ∆micro for measured attributes p-value for slopelt 0001 for all transformations
To test the generality of our findings across model archi-tecture we ran similar experiments on StyleGAN in whichthe latent space is divided into two spaces z andW As [13]notes that theW space is less entangled than z we apply thelinear walk to W and show results in Fig 8 and B5 Oneinteresting aspect of StyleGAN is that we can change colorwhile leaving other structure in the image unchanged Inother words while green faces do not naturally exist in thedataset the StyleGAN model is still able to generate themThis differs from the behavior of BigGAN where changing
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
- Luminance +
- Red +
- Green +
- Blue +Figure 8 Distribution for luminance transformation learnedfrom the StyleGAN cars generator and qualitative exam-ples of color transformations on various datasets usingStyleGAN
color results in different semantics in the image eg turn-ing a dormant volcano to an active one or a daytime sceneto a nighttime one StyleGAN however does not preservethe exact geometry of objects under other transformationseg zoom and shift (see Sec B5)
5 Conclusion
GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much
AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx
Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC
References[1] A Amini A Soleimany W Schwarting S Bhatia and
D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2
[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2
[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3
[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4
[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3
[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3
[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2
[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2
[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3
[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3
[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3
[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2
[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12
[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24
[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10
[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2
[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3
[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4
[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2
[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2
[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12
[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3
[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2
[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2
[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11
[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3
Appendix
A Method detailsA1 Optimization for the linear walk
We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model
A2 Implementation Details for Linear Walk
We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below
Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage
Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene
Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB
where we jointly learn the three walk directions wR wGand wB
Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget
Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing
A3 Linear NN(z) walk
Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z
Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is
edit(G(z) 2α) = edit(edit(G(z) α) α)
To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows
z1 = z + αF (z)
z2 = z1 + αF (z1)
However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α
z + 2αF (z) = z1 + αF (z1) (4)
Therefore
z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)
Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z
A4 Optimization for the non-linear walk
Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)
We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps
B Additional ExperimentsB1 Model and Data Distributions
How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data
B2 Quantifying Transformation Limits
We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases
Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent
trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)
B3 Detected Bounding Boxes
To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis
B4 Alternative Walks in BigGAN
B41 LPIPS objective
In the main text we learn the latent space walk w by mini-mizing the objective function
J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)
using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)
B42 Non-linear Walks
Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic
2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API
B43 Additional Qualitative Examples
We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively
B5 Walks in StyleGAN
We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses
in Figs 20 22 24
B6 Qualitative examples for additional transfor-mations
Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)
00 02 04 06 08 10Area
05
10
15
20
PD
F
105
data
model
00 02 04 06 08 10Center Y
0
1
2
PD
F
102
data
model
00 02 04 06 08 10Center X
00
05
10
15
20
PD
F
102
data
model
00 02 04 06 08 10Pixel Intensity
2
3
4
5
PD
F
103
data
model
ZoomShift YShift XLuminance
Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset
Luminance
Rotate 2D
Shift X Shift Y
Zoom
10 05 00 05 10crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
50 0 50crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
2 1 0 1 2log(crarr)
00
02
04
06
08
Per
ceptu
alD
ista
nce
Rotate 3D
500 250 0 250 500crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude
Figure 11 Bounding boxes for random selected classes using ImageNet training images
Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α
Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)
Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk
10 05 00 05 10crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100In
ters
ection
2 0 2log(crarr)
000
025
050
075
100
Inte
rsec
tion
020 025 030Data
01
02
03
04
05
Model
micro
010 015 020Data
00
02
04
Model
micro
005 010 015Data
01
02
03
04
Model
micro
02 03Data
02
04
06
08
Model
micro
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
10 05 00 05 10crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
2 0 2log(crarr)
20
30
40
50
FID
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25P
DF
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
p=049 p=049 p=055 p=060
Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation
- Color +- Zoom +
- Shift X + - Shift Y +
- Rotate 2D + - Rotate 3D +
Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
5 Conclusion
GANs are powerful generative models but are they sim-ply replicating the existing training datapoints or are theyable to generalize outside of the training distribution Weinvestigate this question by exploring walks in the latentspace of GANs We optimize trajectories in latent space toreflect simple image transformations in the generated out-put learned in a self-supervised manner We find that themodel is able to exhibit characteristics of extrapolation - weare able to ldquosteerrdquo the generated output to simulate camerazoom horizontal and vertical movement camera rotationsand recolorization However our ability to shift the distri-bution is finite we can transform images to some degree butcannot extrapolate entirely outside the support of the train-ing data Because training datasets are biased in capturingthe most basic properties of the natural visual world trans-formations in the latent space of generative models can onlydo so much
AcknowledgementsWe would like to thank Quang H Le Lore Goetschalckx
Alex Andonian David Bau and Jonas Wulff for helpful dis-cussions This work was supported by a Google FacultyResearch Award to PI and a US National Science Foun-dation Graduate Research Fellowship to LC
References[1] A Amini A Soleimany W Schwarting S Bhatia and
D Rus Uncovering and mitigating algorithmic bias throughlearned latent structure 2
[2] A Azulay and Y Weiss Why do deep convolutional net-works generalize so poorly to small image transformationsarXiv preprint arXiv180512177 2018 2
[3] D Bau J-Y Zhu H Strobelt B Zhou J B TenenbaumW T Freeman and A Torralba Gan dissection Visualizingand understanding generative adversarial networks arXivpreprint arXiv181110597 2018 3
[4] A Brock J Donahue and K Simonyan Large scale gantraining for high fidelity natural image synthesis arXivpreprint arXiv180911096 2018 1 3 4
[5] T S Cohen M Weiler B Kicanaoglu and M WellingGauge equivariant convolutional networks and the icosahe-dral cnn arXiv preprint arXiv190204615 2019 3
[6] E Denton B Hutchinson M Mitchell and T Gebru De-tecting bias with generative counterfactual face attribute aug-mentation arXiv preprint arXiv190606439 2019 3
[7] W T Freeman and E H Adelson The design and use ofsteerable filters IEEE Transactions on Pattern Analysis ampMachine Intelligence (9)891ndash906 1991 2
[8] R Geirhos P Rubisch C Michaelis M Bethge F A Wich-mann and W Brendel Imagenet-trained cnns are biased to-wards texture increasing shape bias improves accuracy androbustness arXiv preprint arXiv181112231 2018 2
[9] L Goetschalckx A Andonian A Oliva and P Isola Gan-alyze Toward visual definitions of cognitive image proper-ties arXiv preprint arXiv190610112 2019 3
[10] I Goodfellow J Pouget-Abadie M Mirza B XuD Warde-Farley S Ozair A Courville and Y Bengio Gen-erative adversarial nets In Advances in neural informationprocessing systems pages 2672ndash2680 2014 1 3
[11] G E Hinton A Krizhevsky and S D Wang Transform-ing auto-encoders In International Conference on ArtificialNeural Networks pages 44ndash51 Springer 2011 3
[12] A Jahanian S Vishwanathan and J P Allebach Learn-ing visual balance from large-scale datasets of aestheticallyhighly rated images In Human Vision and Electronic Imag-ing XX volume 9394 page 93940Y International Society forOptics and Photonics 2015 2
[13] T Karras S Laine and T Aila A style-based genera-tor architecture for generative adversarial networks arXivpreprint arXiv181204948 2018 1 2 3 4 8 12
[14] D E King Dlib-ml A machine learning toolkit Journal ofMachine Learning Research 101755ndash1758 2009 24
[15] D P Kingma and J Ba Adam A method for stochasticoptimization arXiv preprint arXiv14126980 2014 10
[16] D P Kingma and P Dhariwal Glow Generative flow withinvertible 1x1 convolutions In Advances in Neural Informa-tion Processing Systems pages 10236ndash10245 2018 2
[17] K Lenc and A Vedaldi Understanding image representa-tions by measuring their equivariance and equivalence InProceedings of the IEEE conference on computer vision andpattern recognition pages 991ndash999 2015 3
[18] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A C Berg Ssd Single shot multibox detectorIn European conference on computer vision pages 21ndash37Springer 2016 4
[19] E Mezuman and Y Weiss Learning about canonical viewsfrom internet image collections In Advances in neural infor-mation processing systems pages 719ndash727 2012 2
[20] A Radford L Metz and S Chintala Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks arXiv preprint arXiv151106434 2015 2
[21] Y Shen J Gu J-t Huang X Tang and B Zhou Interpret-ing the latent space of gans for semantic face editing arXivpreprint 2019 3 12
[22] J Simon Ganbreeder httphttpsganbreederapp accessed 2019-03-22 3
[23] A Torralba and A A Efros Unbiased look at dataset bias2011 2
[24] P Upchurch J Gardner G Pleiss R Pless N SnavelyK Bala and K Weinberger Deep feature interpolation forimage content changes In Proceedings of the IEEE con-ference on computer vision and pattern recognition pages7064ndash7073 2017 2
[25] R Zhang P Isola A A Efros E Shechtman and O WangThe unreasonable effectiveness of deep features as a percep-tual metric In CVPR 2018 3 11
[26] J-Y Zhu P Krahenbuhl E Shechtman and A A EfrosGenerative visual manipulation on the natural image mani-fold In European Conference on Computer Vision pages597ndash613 Springer 2016 3
Appendix
A Method detailsA1 Optimization for the linear walk
We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model
A2 Implementation Details for Linear Walk
We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below
Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage
Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene
Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB
where we jointly learn the three walk directions wR wGand wB
Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget
Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing
A3 Linear NN(z) walk
Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z
Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is
edit(G(z) 2α) = edit(edit(G(z) α) α)
To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows
z1 = z + αF (z)
z2 = z1 + αF (z1)
However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α
z + 2αF (z) = z1 + αF (z1) (4)
Therefore
z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)
Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z
A4 Optimization for the non-linear walk
Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)
We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps
B Additional ExperimentsB1 Model and Data Distributions
How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data
B2 Quantifying Transformation Limits
We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases
Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent
trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)
B3 Detected Bounding Boxes
To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis
B4 Alternative Walks in BigGAN
B41 LPIPS objective
In the main text we learn the latent space walk w by mini-mizing the objective function
J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)
using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)
B42 Non-linear Walks
Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic
2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API
B43 Additional Qualitative Examples
We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively
B5 Walks in StyleGAN
We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses
in Figs 20 22 24
B6 Qualitative examples for additional transfor-mations
Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)
00 02 04 06 08 10Area
05
10
15
20
PD
F
105
data
model
00 02 04 06 08 10Center Y
0
1
2
PD
F
102
data
model
00 02 04 06 08 10Center X
00
05
10
15
20
PD
F
102
data
model
00 02 04 06 08 10Pixel Intensity
2
3
4
5
PD
F
103
data
model
ZoomShift YShift XLuminance
Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset
Luminance
Rotate 2D
Shift X Shift Y
Zoom
10 05 00 05 10crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
50 0 50crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
2 1 0 1 2log(crarr)
00
02
04
06
08
Per
ceptu
alD
ista
nce
Rotate 3D
500 250 0 250 500crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude
Figure 11 Bounding boxes for random selected classes using ImageNet training images
Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α
Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)
Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk
10 05 00 05 10crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100In
ters
ection
2 0 2log(crarr)
000
025
050
075
100
Inte
rsec
tion
020 025 030Data
01
02
03
04
05
Model
micro
010 015 020Data
00
02
04
Model
micro
005 010 015Data
01
02
03
04
Model
micro
02 03Data
02
04
06
08
Model
micro
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
10 05 00 05 10crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
2 0 2log(crarr)
20
30
40
50
FID
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25P
DF
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
p=049 p=049 p=055 p=060
Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation
- Color +- Zoom +
- Shift X + - Shift Y +
- Rotate 2D + - Rotate 3D +
Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
Appendix
A Method detailsA1 Optimization for the linear walk
We learn the walk vector using mini-batch stochasticgradient descent with the Adam optimizer [15] in tensor-flow trained on 20000 unique samples from the latent spacez We share the vector w across all ImageNet categories forthe BigGAN model
A2 Implementation Details for Linear Walk
We experiment with a number of different transforma-tions learned in the latent space each corresponding to adifferent walk vector Each of these transformations can belearned without any direct supervision simply by applyingour desired edit to the source image Furthermore the pa-rameter α allows us to vary the extent of the transformationWe found that a slight modification to each transformationimproved the degree to which we were able to steer the out-put space we scale α differently for the learned transfor-mation G(z + αgw) and the target edit edit(G(z) αt)We detail each transformation below
Shift We learn transformations corresponding to shiftingan image in the horizontal X direction and the vertical Ydirection We train on source images that are shifted minusαtpixels to the left and αt pixels to the right where we set αtto be between zero and one-half of the source image widthor heightD When training the walk we enforce that the αgparameter ranges between -1 and 1 thus for a random shiftby t pixels we use the value αg = αtD We apply a maskto the shifted image so that we only apply the loss functionon the visible portion of the source image This forces thegenerator to extrapolate on the obscured region of the targetimage
Zoom We learn a walk which is optimized to zoom inand out up to four times the original image For zoomingin we crop the central portion of the source image by someαt amount where 025 lt αt lt 1 and resize it back to itsoriginal size To zoom out we downsample the image byαt where 1 lt αt lt 4 To allow for both a positive andnegative walk direction we set αg = log(αt) Similar toshift a mask applied during training allows the generator toinpaint the background scene
Color We implement color as a continuous RGB slidereg a 3-tuple αt = (αR αG αB) where each αR αG αBcan take values between [minus05 05] in training To edit thesource image we simply add the corresponding αt valuesto each of the image channels Our latent space walk isparameterized as z+αgw = z+αRwR +αGwG +αBwB
where we jointly learn the three walk directions wR wGand wB
Rotate in 2D Rotation in 2D is trained in a similar man-ner as the shift operations where we train withminus45 le αt le45 degree rotation Using R = 45 scale αg = αtR Weuse a mask to enforce the loss only on visible regions of thetarget
Rotate in 3D We simulate a 3D rotation using a perspec-tive transformation along the Z-axis essentially treating theimage as a rotating billboard Similar to the 2D rotationwe train with minus45 le αt le 45 degree rotation we scaleαg = αtR where R = 45 and apply a mask during train-ing
A3 Linear NN(z) walk
Rather than defining w as a vector in z space (Eq 1) onecould define it as a function that takes a z as input and mapsit to the desired zprime after taking a variable-sized step α in la-tent space In this case we may parametrize the walk witha neural network w = NN(z) and transform the image us-ing G(z+αNN(z)) However as we show in the followingproof this idea will not learn to let w be a function of z
Proof For simplicity let w = F (z) We optimize forJ(wα) = Ez [L(G(z + αw)edit(G(z) α))] where αis an arbitrary scalar value Note that for the target imagetwo equal edit operations is equivalent to performing a sin-gle edit of twice the size (eg shifting by 10px the same asshifting by 5px twice zooming by 4x is the same as zoom-ing by 2x twice) That is
edit(G(z) 2α) = edit(edit(G(z) α) α)
To achieve this target starting from an initial z we can taketwo steps of size α in latent space as follows
z1 = z + αF (z)
z2 = z1 + αF (z1)
However because we let α take on any scalar value duringoptimization our objective function enforces that startingfrom z and taking a step of size 2α equals taking two stepsof size α
z + 2αF (z) = z1 + αF (z1) (4)
Therefore
z + 2αF (z) = z + αF (z) + αF (z1)rArrαF (z) = αF (z1)rArrF (z) = F (z1)
Thus F (middot) simply becomes a linear trajectory that is inde-pendent of the input z
A4 Optimization for the non-linear walk
Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)
We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps
B Additional ExperimentsB1 Model and Data Distributions
How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data
B2 Quantifying Transformation Limits
We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases
Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent
trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)
B3 Detected Bounding Boxes
To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis
B4 Alternative Walks in BigGAN
B41 LPIPS objective
In the main text we learn the latent space walk w by mini-mizing the objective function
J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)
using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)
B42 Non-linear Walks
Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic
2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API
B43 Additional Qualitative Examples
We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively
B5 Walks in StyleGAN
We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses
in Figs 20 22 24
B6 Qualitative examples for additional transfor-mations
Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)
00 02 04 06 08 10Area
05
10
15
20
PD
F
105
data
model
00 02 04 06 08 10Center Y
0
1
2
PD
F
102
data
model
00 02 04 06 08 10Center X
00
05
10
15
20
PD
F
102
data
model
00 02 04 06 08 10Pixel Intensity
2
3
4
5
PD
F
103
data
model
ZoomShift YShift XLuminance
Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset
Luminance
Rotate 2D
Shift X Shift Y
Zoom
10 05 00 05 10crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
50 0 50crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
2 1 0 1 2log(crarr)
00
02
04
06
08
Per
ceptu
alD
ista
nce
Rotate 3D
500 250 0 250 500crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude
Figure 11 Bounding boxes for random selected classes using ImageNet training images
Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α
Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)
Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk
10 05 00 05 10crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100In
ters
ection
2 0 2log(crarr)
000
025
050
075
100
Inte
rsec
tion
020 025 030Data
01
02
03
04
05
Model
micro
010 015 020Data
00
02
04
Model
micro
005 010 015Data
01
02
03
04
Model
micro
02 03Data
02
04
06
08
Model
micro
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
10 05 00 05 10crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
2 0 2log(crarr)
20
30
40
50
FID
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25P
DF
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
p=049 p=049 p=055 p=060
Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation
- Color +- Zoom +
- Shift X + - Shift Y +
- Rotate 2D + - Rotate 3D +
Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
A4 Optimization for the non-linear walk
Given the limitations of the previous walk we define ournonlinear walk F (z) using discrete step sizes ε We defineF (z) as z + NN(z) where the neural network NN learnsa fixed ε step transformation rather than a variable α stepWe then renormalize the magnitude z This approach mim-ics the Euler method for solving ODEs with a discrete stepsize where we assume that the gradient of the transforma-tion in latent space is of the form εdzdt = NN(z) and we ap-proximate zi+1 = zi + εdzdt |zi The key difference from A3is the fixed step size which avoids optimizing for the equal-ity in (4)
We use a two-layer neural network to parametrize thewalk and optimize over 20000 samples using the Adamoptimizer as before Positive and negative transformationdirections are handled with two neural networks havingidentical architecture but independent weights We set ε toachieve the same transformation ranges as the linear trajec-tory within 4-5 steps
B Additional ExperimentsB1 Model and Data Distributions
How well does the model distribution of each propertymatch the dataset distribution If the generated images donot form a good approximation of the dataset variability weexpect that this would also impact our ability to transformgenerated images In Fig 9 we show the attribute distri-butions of the BigGAN model G(z) compared to samplesfrom the ImageNet dataset We show corresponding resultsfor StyleGAN and its respective datasets in Sec B5 Whilethere is some bias in how well model-generated images ap-proximated the dataset distribution we hypothesize that ad-ditional biases in our transformations come from variabilityin the training data
B2 Quantifying Transformation Limits
We observe that when we increase the transformationmagnitude α in latent space the generated images becomeunrealistic and the transformation ceases to have further ef-fect We show this qualitatively in Fig 3 To quantitativelyverify this trends we can compute the LPIPS perceptualdistance of images generated using consecutive pairs of αiand αi+1 For shift and zoom transformations perceptualdistance is larger when α (or log(α) for zoom) is near zeroand decreases as the the magnitude of α increases whichindicates that large α magnitudes have a smaller transfor-mation effect and the transformed images appear more sim-ilar On the other hand color and rotate in 2D3D exhibit asteady transformation rate as the magnitude of α increases
Note that this analysis does not tell us how well weachieve the specific transformation nor whether the latent
trajectory deviates from natural-looking images Rather ittells us how much we manage to change the image regard-less of the transformation target To quantify how well eachtransformation is achieved we rely on attribute detectorssuch as object bounding boxes (see B3)
B3 Detected Bounding Boxes
To quantify the degree to which we are able to achievethe zoom and shift transformations we rely on a pre-trainedMobileNet-SSD v12 object detection model In Fig 11and 12 we show the results of applying the object detectionmodel to images from the dataset and images generatedby the model under the zoom horizontal shift and verti-cal shift transformations for randomly selected values of αto qualitatively verify that the object detection boundariesare reasonable Not all ImageNet images contain recogniz-able objects so we only use ImageNet classes containingobjects recognizable by the detector for this analysis
B4 Alternative Walks in BigGAN
B41 LPIPS objective
In the main text we learn the latent space walk w by mini-mizing the objective function
J(wα) = Ez [L(G(z + αw)edit(G(z) α))] (5)
using a Euclidean loss for L In Fig 13 we show qualitativeresults using the LPIPS perceptual similarity metric [25] in-stead of Euclidean loss Walks were trained using the sameparameters as those in the linear-L2 walk shown in the maintext we use 20k samples for training with Adam optimizerand learning rate 0001 for zoom and color 00001 for theremaining edit operations (due to scaling of α)
B42 Non-linear Walks
Following B42 we modify our objective to use discretestep sizes ε rather than continuous steps We learn a func-tion F (z) to perform this ε-step transformation on given la-tent code z where F (z) is parametrized with a neural net-work We show qualitative results in Fig 14 We performthe same set of experiments shown in the main text usingthis nonlinear walk in Fig 15 These experiments exhibitsimilar trends as we observed in the main text ndash we are ableto modify the generated distribution of images using latentspace walks and the amount to which we can transformis related to the variability in the dataset However thereare greater increases in FID when we apply the non-lineartransformation suggesting that these generated images de-viate more from natural images and look less realistic
2httpsgithubcomopencvopencvwikiTensorFlow-Object-Detection-API
B43 Additional Qualitative Examples
We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively
B5 Walks in StyleGAN
We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses
in Figs 20 22 24
B6 Qualitative examples for additional transfor-mations
Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)
00 02 04 06 08 10Area
05
10
15
20
PD
F
105
data
model
00 02 04 06 08 10Center Y
0
1
2
PD
F
102
data
model
00 02 04 06 08 10Center X
00
05
10
15
20
PD
F
102
data
model
00 02 04 06 08 10Pixel Intensity
2
3
4
5
PD
F
103
data
model
ZoomShift YShift XLuminance
Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset
Luminance
Rotate 2D
Shift X Shift Y
Zoom
10 05 00 05 10crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
50 0 50crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
2 1 0 1 2log(crarr)
00
02
04
06
08
Per
ceptu
alD
ista
nce
Rotate 3D
500 250 0 250 500crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude
Figure 11 Bounding boxes for random selected classes using ImageNet training images
Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α
Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)
Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk
10 05 00 05 10crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100In
ters
ection
2 0 2log(crarr)
000
025
050
075
100
Inte
rsec
tion
020 025 030Data
01
02
03
04
05
Model
micro
010 015 020Data
00
02
04
Model
micro
005 010 015Data
01
02
03
04
Model
micro
02 03Data
02
04
06
08
Model
micro
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
10 05 00 05 10crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
2 0 2log(crarr)
20
30
40
50
FID
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25P
DF
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
p=049 p=049 p=055 p=060
Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation
- Color +- Zoom +
- Shift X + - Shift Y +
- Rotate 2D + - Rotate 3D +
Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
B43 Additional Qualitative Examples
We show qualitative examples for randomly generated cat-egories for BigGAN linear-L2 linear LPIPS and nonlineartrajectories in Figs 16 17 18 respectively
B5 Walks in StyleGAN
We perform similar experiments for linear latent spacewalks using StyleGAN models trained on the LSUN catLSUN car and FFHQ face datasets As suggested by [13]we learn the walk vector in the intermediate W latent spacedue to improved attribute disentanglement in W We showqualitative results for color shift and zoom transformationsin Figs 19 21 23 and corresponding quantitative analyses
in Figs 20 22 24
B6 Qualitative examples for additional transfor-mations
Since the color transformation operates on individualpixels we can optimize the walk using a segmented tar-get ndash for example when learning a walk for cars we onlymodify pixels in segmented car region when generatingedit(G(z) α) StyleGAN is able to roughly localize thecolor transformation to this region suggesting disentangle-ment of different objects within the W latent space (Fig 25left) as also noted in [13 21] We also show qualitative re-sults for adjust image contrast (Fig 25 right) and for com-bining zoom shift X and shift Y transformations (Fig 26)
00 02 04 06 08 10Area
05
10
15
20
PD
F
105
data
model
00 02 04 06 08 10Center Y
0
1
2
PD
F
102
data
model
00 02 04 06 08 10Center X
00
05
10
15
20
PD
F
102
data
model
00 02 04 06 08 10Pixel Intensity
2
3
4
5
PD
F
103
data
model
ZoomShift YShift XLuminance
Figure 9 Comparing model versus dataset distribution We plot statistics of the generated under the color (luminance) zoom(object bounding box size) and shift operations (bounding box center) and compare them to the statistics of images in thetraining dataset
Luminance
Rotate 2D
Shift X Shift Y
Zoom
10 05 00 05 10crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
50 0 50crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
400 200 0 200 400crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
2 1 0 1 2log(crarr)
00
02
04
06
08
Per
ceptu
alD
ista
nce
Rotate 3D
500 250 0 250 500crarr
00
02
04
06
08
Per
ceptu
alD
ista
nce
Figure 10 LPIPS Perceptual distances between images generated from pairs of consecutive αi and αi+1 We sample 1000images from randomly selected categories using BigGAN transform them according to the learned linear trajectory for eachtransformation We plot the mean perceptual distance and one standard deviation across the 1000 samples (shaded area) aswell as 20 individual samples (scatterplot) Because the Rotate 3D operation undershoots the targeted transformation weobserve more visible effects when we increase the α magnitude
Figure 11 Bounding boxes for random selected classes using ImageNet training images
Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α
Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)
Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk
10 05 00 05 10crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100In
ters
ection
2 0 2log(crarr)
000
025
050
075
100
Inte
rsec
tion
020 025 030Data
01
02
03
04
05
Model
micro
010 015 020Data
00
02
04
Model
micro
005 010 015Data
01
02
03
04
Model
micro
02 03Data
02
04
06
08
Model
micro
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
10 05 00 05 10crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
2 0 2log(crarr)
20
30
40
50
FID
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25P
DF
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
p=049 p=049 p=055 p=060
Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation
- Color +- Zoom +
- Shift X + - Shift Y +
- Rotate 2D + - Rotate 3D +
Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
Figure 11 Bounding boxes for random selected classes using ImageNet training images
Figure 12 Bounding boxes for random selected classes using model-generated images for zoom and horizontal and verticalshift transformations under random values of α
Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)
Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk
10 05 00 05 10crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100In
ters
ection
2 0 2log(crarr)
000
025
050
075
100
Inte
rsec
tion
020 025 030Data
01
02
03
04
05
Model
micro
010 015 020Data
00
02
04
Model
micro
005 010 015Data
01
02
03
04
Model
micro
02 03Data
02
04
06
08
Model
micro
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
10 05 00 05 10crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
2 0 2log(crarr)
20
30
40
50
FID
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25P
DF
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
p=049 p=049 p=055 p=060
Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation
- Color +- Zoom +
- Shift X + - Shift Y +
- Rotate 2D + - Rotate 3D +
Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
Figure 13 Linear walks in BigGAN trained to minimize LPIPS loss For comparison we show the same samples as in Fig 1(which used a linear walk with L2 loss)
Figure 14 Nonlinear walks in BigGAN trained to minimize L2 loss for color and LPIPS loss for the remaining transforma-tions For comparison we show the same samples in Fig 1 (which used a linear walk with L2 loss) replacing the linear walkvector w with a nonlinear walk
10 05 00 05 10crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100In
ters
ection
2 0 2log(crarr)
000
025
050
075
100
Inte
rsec
tion
020 025 030Data
01
02
03
04
05
Model
micro
010 015 020Data
00
02
04
Model
micro
005 010 015Data
01
02
03
04
Model
micro
02 03Data
02
04
06
08
Model
micro
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
10 05 00 05 10crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
2 0 2log(crarr)
20
30
40
50
FID
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25P
DF
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
p=049 p=049 p=055 p=060
Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation
- Color +- Zoom +
- Shift X + - Shift Y +
- Rotate 2D + - Rotate 3D +
Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
10 05 00 05 10crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100
Inte
rsec
tion
200 100 0 100 200crarr
000
025
050
075
100In
ters
ection
2 0 2log(crarr)
000
025
050
075
100
Inte
rsec
tion
020 025 030Data
01
02
03
04
05
Model
micro
010 015 020Data
00
02
04
Model
micro
005 010 015Data
01
02
03
04
Model
micro
02 03Data
02
04
06
08
Model
micro
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
10 05 00 05 10crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
200 100 0 100 200crarr
20
30
40
50
FID
2 0 2log(crarr)
20
30
40
50
FID
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
00 02 04 06 08 10Pixel Intensity
00
05
10
15
20
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
00 02 04 06 08 10Center X
00
05
10
15
20
25P
DF
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Center Y
0
1
2
3
4
PD
F
102
model
crarr=-200
crarr=-150
crarr=-100
crarr=-50
crarr=50
crarr=100
crarr=150
crarr=200
00 02 04 06 08 10Area
00
05
10
15
PD
F
104
model
crarr=00625
crarr=0125
crarr=025
crarr=05
crarr=20
crarr=40
crarr=80
crarr=160
Luminance Shift X Shift Y Zoom
p=049 p=049 p=055 p=060
Figure 15 Quantitative experiments for nonlinear walks in BigGAN We show the attributes of generated images under theraw model outputG(z) compared to the distribution under a learned transformation model(α) the intersection area betweenG(z) and model(α) FID score on transformed images and scatterplots relating dataset variability to the extent of modeltransformation
- Color +- Zoom +
- Shift X + - Shift Y +
- Rotate 2D + - Rotate 3D +
Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
- Color +- Zoom +
- Shift X + - Shift Y +
- Rotate 2D + - Rotate 3D +
Figure 16 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and L2 objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 17 Qualitative examples for randomly selected categories in BigGAN using the linear trajectory and LPIPS objective
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 18 Qualitative examples for randomly selected categories in BigGAN using a nonlinear trajectory
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 19 Qualitative examples for learned transformations using the StyleGAN car generator
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
00 05 10Luminance
05
10
PD
F
102
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
40
FID
00 05 10ShiftX
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
25
50
75
FID
00 05 10ShiftY
00
05
10
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
0
200F
ID
00 05 10Zoom
0
2
4
6
PD
F
106
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
1 0 1crarr
25
50
75
FID
000 025 050 075 100ShiftY
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100ShiftX
000
025
050
075
100
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance Shift X Shift Y Zoom
Figure 20 Quantitative experiments for learned transformations using the StyleGAN car generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 21 Qualitative examples for learned transformations using the StyleGAN cat generator
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
1 0 1crarr
20
40
FID
00 05 10Luminance
2
3
4
5
PD
F
103
data
model
1 0 1crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
20
25
FID
00 05 10ShiftX
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
100 0 100crarr
20
30
40
FID
00 05 10ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
1 0 1log(crarr)
50
100
FID
00 05 10Zoom
05
10
15
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
000 025 050 075 100ShiftX
00
05
10
15
20
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100ShiftY
0
1
2
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
0
2
4
6
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Luminance Shift X Shift Y Zoom
Figure 22 Quantitative experiments for learned transformations using the StyleGAN cat generator
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
- Rotate 2D +
- Shift X +
- Zoom +
- Rotate 3D +
- Shift Y +
- Color +
Figure 23 Qualitative examples for learned transformations using the StyleGAN FFHQ face generator
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
04 06 08ShiftX
0
2
4
6P
DF
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Shift X Shift Y Zoom
1 0 1crarr
50
100
150
FID
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
100 0 100crarr
50
100
FID
04 06 08ShiftX
0
2
4
PD
F
102
data
model
100 0 100crarr
00
05
10
Inte
rsec
tion
100 0 100crarr
50
100
FID
04 06 08ShiftY
0
1
2
PD
F
102
data
model
100 0 100crarr
00
05
10In
ters
ecti
on
1 0 1log(crarr)
100
200
300
FID
00 05 10Zoom
00
05
10
PD
F
105
data
model
1 0 1log(crarr)
00
05
10
Inte
rsec
tion
1 0 1crarr
00
05
10
Inte
rsec
tion
00 05 10Luminance
2
4
PD
F
103
data
model
000 025 050 075 100Luminance
00
05
10
15
PD
F
102
model
crarr=-10
crarr=-075
crarr=-05
crarr=-025
crarr=025
crarr=05
crarr=075
crarr=10
Luminance
04 06 08ShiftX
0
2
4
6
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
04 06 08ShiftY
0
2
4
PD
F
102
model
crarr=-1000
crarr=-750
crarr=-500
crarr=-250
crarr=250
crarr=500
crarr=750
crarr=1000
000 025 050 075 100Zoom
00
05
10
15
20
PD
F
105
model
crarr=025
crarr=035
crarr=050
crarr=071
crarr=141
crarr=200
crarr=283
crarr=400
Figure 24 Quantitative experiments for learned transformations using the StyleGAN FFHQ face generator For the zoomoperation not all faces are detectable we plot the distribution as zeros for α values in which no face is detected We use thedlib face detector [14] for bounding box coordinates
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center
- Contrast +- Color +
Figure 25 Qualitative examples of optimizing for a color walk with a segmented target using StyleGAN in left column anda contrast walk for both BigGAN and StyleGAN in the right column
Figure 26 Qualitative examples of a linear walk combining the zoom shift X and shift Y transformations First row showsthe target image second row shows the result of learning a walk for the three transformations jointly and the third row showsresults for combining the separately trained walks Green vertical line denotes image center