4
High-Resolution Multi-Scale Neural Texture Synthesis Xavier Snelgrove Toronto, Ontario, Canada [email protected] (a) Source texture (b) Synthesized image Figure 1: High resolution texture synthesis matching CNN texture statistics at 5 image scales ABSTRACT We introduce a novel multi-scale approach for synthesizing high- resolution natural textures using convolutional neural networks trained on image classification tasks. Previous breakthroughs were based on the observation that correlations between features at inter- mediate layers of the network are a powerful texture representation, however the fixed receptive field of network neurons limits the maximum size of texture features that can be synthesized. We show that rather than matching statistical properties at many layers of the CNN, better results can be achieved by matching a small number of network layers but across many scales of a Gaussian pyramid. This leads to qualitatively superior synthesized high-resolution textures. CCS CONCEPTS Computing methodologies Texturing; KEYWORDS texture synthesis, neural networks, Gaussian pyramid SA ’17 Technical Briefs, November 27–30, 2017, Bangkok, Thailand © 2017 Copyright held by the owner/author(s). Publication rights licensed to Associa- tion for Computing Machinery. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in SA ’17 Technical Briefs: SIGGRAPH Asia 2017 Technical Briefs, November 27–30, 2017, Bangkok, Thailand, https://doi.org/10.1145/3145749.3149449. ACM Reference Format: Xavier Snelgrove. 2017. High-Resolution Multi-Scale Neural Texture Syn- thesis. In SA ’17 Technical Briefs: SIGGRAPH Asia 2017 Technical Briefs, No- vember 27–30, 2017, Bangkok, Thailand. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3145749.3149449 1 INTRODUCTION There have been recent significant improvements in the quality of example-based texture synthesis techniques by taking advantage of intermediate representations in a convolutional neural network (CNN) trained to classify images [Gatys et al. 2015b]. Correlations between features maps at these intermediate layers, represented by a Gram matrix, turn out to be a powerful represen- tation of texture. By synthesizing new images whose Gram matrix is close to that of an exemplar image (for instance via gradient descent), we get images with similar texture. However, this only works well when the semantically significant features in the image are at the correct scale for the network, and in practice the receptive field of a feature at an intermediate layer for common CNN architectures is relatively small [Luo et al. 2016]. The popular VGG architectures from Simonyan and Zisserman [2014], used by Gatys et al. and others, are trained on 224 ×224 pixel images, in which relevant features will be quite a bit smaller. So, given a high-resolution source image, for optimal results an artist must scale down that image until the pixel-scale of the features of interest match the receptive field of the appropriate semantic layer of the network. This limits the resolution of the rendered image, and further breaks down for source images with

High-Resolution Multi-Scale Neural Texture Synthesis · High-Resolution Multi-Scale Neural Texture ... High-Resolution Multi-Scale Neural Texture Synthesis SA ... would be required

  • Upload
    dinhanh

  • View
    232

  • Download
    3

Embed Size (px)

Citation preview

Page 1: High-Resolution Multi-Scale Neural Texture Synthesis · High-Resolution Multi-Scale Neural Texture ... High-Resolution Multi-Scale Neural Texture Synthesis SA ... would be required

High-Resolution Multi-Scale Neural Texture SynthesisXavier Snelgrove

Toronto, Ontario, [email protected]

(a) Source texture (b) Synthesized image

Figure 1: High resolution texture synthesis matching CNN texture statistics at 5 image scales

ABSTRACTWe introduce a novel multi-scale approach for synthesizing high-resolution natural textures using convolutional neural networkstrained on image classification tasks. Previous breakthroughs werebased on the observation that correlations between features at inter-mediate layers of the network are a powerful texture representation,however the fixed receptive field of network neurons limits themaximum size of texture features that can be synthesized.

We show that rather than matching statistical properties at manylayers of the CNN, better results can be achieved by matchinga small number of network layers but across many scales of aGaussian pyramid. This leads to qualitatively superior synthesizedhigh-resolution textures.

CCS CONCEPTS• Computing methodologies→ Texturing;

KEYWORDStexture synthesis, neural networks, Gaussian pyramid

SA ’17 Technical Briefs, November 27–30, 2017, Bangkok, Thailand© 2017 Copyright held by the owner/author(s). Publication rights licensed to Associa-tion for Computing Machinery.This is the author’s version of the work. It is posted here for your personal use. Notfor redistribution. The definitive Version of Record was published in SA ’17 TechnicalBriefs: SIGGRAPH Asia 2017 Technical Briefs, November 27–30, 2017, Bangkok, Thailand,https://doi.org/10.1145/3145749.3149449.

ACM Reference Format:Xavier Snelgrove. 2017. High-Resolution Multi-Scale Neural Texture Syn-thesis. In SA ’17 Technical Briefs: SIGGRAPH Asia 2017 Technical Briefs, No-vember 27–30, 2017, Bangkok, Thailand. ACM, New York, NY, USA, 4 pages.https://doi.org/10.1145/3145749.3149449

1 INTRODUCTIONThere have been recent significant improvements in the quality ofexample-based texture synthesis techniques by taking advantageof intermediate representations in a convolutional neural network(CNN) trained to classify images [Gatys et al. 2015b].

Correlations between features maps at these intermediate layers,represented by a Gram matrix, turn out to be a powerful represen-tation of texture. By synthesizing new images whose Gram matrixis close to that of an exemplar image (for instance via gradientdescent), we get images with similar texture.

However, this only works well when the semantically significantfeatures in the image are at the correct scale for the network, and inpractice the receptive field of a feature at an intermediate layer forcommon CNN architectures is relatively small [Luo et al. 2016]. Thepopular VGG architectures from Simonyan and Zisserman [2014],used by Gatys et al. and others, are trained on 224×224 pixel images,in which relevant features will be quite a bit smaller.

So, given a high-resolution source image, for optimal resultsan artist must scale down that image until the pixel-scale of thefeatures of interest match the receptive field of the appropriatesemantic layer of the network. This limits the resolution of therendered image, and further breaks down for source images with

Page 2: High-Resolution Multi-Scale Neural Texture Synthesis · High-Resolution Multi-Scale Neural Texture ... High-Resolution Multi-Scale Neural Texture Synthesis SA ... would be required

SA ’17 Technical Briefs, November 27–30, 2017, Bangkok, Thailand Xavier Snelgrove

textures at multiple scales. The artist must choose to capture onescale of texture at the expense of the other.

In this work we propose a multi-scale neural texture synthesisapproach, in which the optimization simultaneously matches tex-tures at every layer of a Gaussian pyramid [Adelson et al. 1984].The artist no longer needs to trade off between image resolutionand texture fidelity, and further can synthesize textures with statis-tics at multiple scales. For neural texture synthesis this issue hasbeen tackled by exploiting side-effects of the optimization process[Gatys et al. 2016b], however despite the long history of multi-scalepyramid methods in texture synthesis[Han et al. 2008; Lefebvreand Hoppe 2005; Portilla and Simoncelli 2000], as far as the authorsknow this is the first work using these in the context of neuraltexture synthesis.

2 GRAMMATRIX TEXTURE SYNTHESISGatys et al. [2015a] introduced the idea of using the Gram matrixas a spatially invariant representation feature correlations, whichhas been used in many works since. [Gatys et al. 2016a; Johnsonet al. 2016; Saito et al. 2016; Sendik and Cohen-Or 2017].

Their approach is as follows: they take a vectorized source image®x , and feed it into a CNN (in their case, VGG-19 [Simonyan andZisserman 2014]), computing the activations at each layer of thenetwork.

Defining the number of feature maps for layer l as Nl , withvectorized size Ml , they define the feature matrix for layer l asFl ∈ R

Nl×Ml , where F ljk shows the activation at layer l of the jth

feature in position k .They define the Gram matrix at layer l as Gl ∈ RNl×Nl , the

inner product between the vectorized feature map i and j in layerl summing across every neuron in that map. We slightly modifytheir definition to normalize the representation

Gli j =

1Ml Nl

∑k

F lik Fljk (1)

This Gram matrix parameterizes a texture at a particular layerof the CNN. Entry Gi j is proportional to how often feature i andfeature j are active in the same spatial location. We can determinethe difference between the same layer of a source image and synthe-sized image by taking the squared Frobenius norm of the differencebetween the source image’s Gram matrix Gl and the synthesizedimage’s Gram matrix Gl

El =∑i, j

(Gli j −Gl

i j

)2(2)

In order to synthesize a new image Gatys et al. directly minimizethe sum of these terms across multiple layers:

L(®x , ®x) =L∑l=0

wl El (3)

with wl a selectable weighting factor. They use the L-BFGS opti-mizer [Zhu et al. 1997] with analytic gradients since every operationis differentiable.

3 MULTI-SCALE TEXTURE SYNTHESISNatural textures may contain regular structure at multiple scales.Portilla and Simoncelli [2000] showed very good texture synthesiswith a bank of multi-scale linear filters called a “steerable pyramid”.

Gatys’ work can be seen as an extension of (and indeed credits)Portilla and Simoncelli’s, but instead of a multi-scale linear filter-bank uses a non-linear multi-scale filter bank that is given by theCNN. This is one reason why Equation 3 sums across many layersof the CNN.

However even at the highest layers of the CNN the effectivereceptive field is relatively small [Luo et al. 2016]. What’s more,the degree of abstraction increases with higher levels of the CNNso it is not a purely multi-scale representation but a multi-scalemulti-semantic representation.

All this conspires to lead to Gatys’ approach not being suitedto high-resolution images, which are effectively images with long-range spatial dependencies between pixels [Berger and Memisevic2016].

We propose resolving this by disentangling the semantic andspatial terms in the optimization. We pick only one or two interme-diate layers of VGG-19, but we feed in many layers of a Gaussianpyramid [Adelson et al. 1984]. Each layer of a Gaussian pyramid isformed by blurring and downsampling the previous layer.

We can now modify the equations of Section 2 to incorporatescale. Let the feature matrix for the sth scale octave of the Guassianpyramid and l th CNN layer as Fikl,s , and the corresponding Grammatrix be

Gl,si j =

1Ml Nl

∑k

F l,sik F l,sjk (4)

We can now modify Equations 2 and 3 to include terms for bothlayers and scales.

Esl =∑i, j

(Gsi j −Gs

i j

)2(5)

and

L(®x , ®x) =S−1∑s=0

vs

L−1∑l=0

wl Esl (6)

where S is the number of octaves. In practice we only setvs (octaveweights) and wl (CNN layer weights) to values of either 0 or 1,uniformly weighting all layers and scales of interest. Refer to Figure2 for a block diagram of our system.

4 EXPERIMENTAL RESULTSFor our experiments we use the VGG-19 architecture, but replace themax-pooling layers with average-pooling layers, following Gatys etal. and further use their re-scaled VGG-19weights1 which normalizethe average activation of every feature map across images to 1.

We use valid-mode convolution for all convolutional layers,which is important for boundary effects, especially when synthe-sizing images of a different scale than the source image (so with adifferent ratio of boundary to internal pixels).

We find good results in general using 5 octaves in our Gaussianpyramid, andmatching simultaneously against layers block1_pool1Available at https://github.com/leongatys/DeepTextures

Page 3: High-Resolution Multi-Scale Neural Texture Synthesis · High-Resolution Multi-Scale Neural Texture ... High-Resolution Multi-Scale Neural Texture Synthesis SA ... would be required

High-Resolution Multi-Scale Neural Texture Synthesis SA ’17 Technical Briefs, November 27–30, 2017, Bangkok, Thailand

G3

G2

G1

G0block1_conv1,2

block2_conv1,2

block3_conv1,2,3,4

block1_pool

block2_pool

®x0

®x1

®x2®x3

(more)

®x0

®x1

®x2®x3

(more)G3

G2

G1

G0

N7

N7

L(®x , ®x)

·2

·2

·2

·2

Figure 2: Block diagram of our system. The left-hand-side shows the source image, with Grammatrices of feature correlationsbeing extracted for VGG-19 layer block3_conv2 across 4 distinct spatial scales. On the right is the current state of the opti-mization ®x , whose feature correlation matrices are extracted in the same fashion. The sum of the squared difference of thesematrices corresponds to the loss function L which is minimized by gradient descent on ®x .

and block3_conv2 (the 3rd and 8th layers). Using both a low levellayer and an intermediate layer sometimes allows the optimizationto escape local optima that it would get stuck in using only thehigher level layer.

Figure 3 shows an example result. The source texture in 3a showstexture at multiple scales. At the macro scale it is dominated byregions of blue paint and rusty-red wall, with red streaks in theblue regions. At the micro-scale we see that the red region has arough texture, whereas the blue region has a smooth texture, anda distinct rounded "paint-chipping" shape at the border betweenthe regions. All of these effects are captured in the synthesizedimage 3b. Using Gatys et al. we find that some of the micro-scaleproperties are captured, but the macro-scale red/blue alternationis almost completely missing. Further we can see strong artefactsaround the bright blue point in the corner. Refer to Figure 4 formany more results.

Experiments were performed using the Keras [Chollet et al. 2015]neural networks framework. We used the Scipy [Jones et al. 2001]implementation of L-BFGS-B [Zhu et al. 1997]. The code for theseexperiments is available online2.

2http://github.com/wxs/subjective-functions

5 DISCUSSIONThis work has shown the power of usingmulti-scale representationsin conjunction with neural texture synthesis. Gatys et al. can beseen as a special-case of this work for a single scale octave S = 1.As such, much of work building on their approach would also beapplicable to ours.

A natural extension would be to attempt multi-scale style trans-fer, following [Gatys et al. 2015a] but using our objective functionto represent the “style loss”.

We use a simple optimization process to synthesize our images.Other work has shown that feed-forward neural networks can betrained to approximate this optimization [Johnson et al. 2016] soit would be interesting to attempt those methods with our multi-scale objective. It is likely that some similar multi-scale approachwould be required for the architecture of that network as otherwisethe same receptive-field issues would apply for that feed-forwardnetwork.

Finally, combing our work with other approaches for findinglarger structures in images such as by finding symmetries as in[Berger and Memisevic 2016; Sendik and Cohen-Or 2017] wouldalso be interesting.

Page 4: High-Resolution Multi-Scale Neural Texture Synthesis · High-Resolution Multi-Scale Neural Texture ... High-Resolution Multi-Scale Neural Texture Synthesis SA ... would be required

SA ’17 Technical Briefs, November 27–30, 2017, Bangkok, Thailand Xavier Snelgrove

(a) Source image

(b) Our method and detail (c) [Gatys et al. 2015b]

Figure 3: Our results on a photograph with texture at mul-tiple scales. Zoom in to see the small-scale surface texturewithin the red and blue areas.Photo credit: https://commons.wikimedia.org/wiki/File:Detail_of_rusty_van.JPG

ACKNOWLEDGMENTSWe would like to thank Laura Lewis, Martin Snelgrove, ClaireFreeman-Fawcett, and Severin Smith for useful discussions andfor reading drafts of this work.

REFERENCESE H Adelson, C H Anderson, and J R Bergen. 1984. Pyramid methods in image

processing. RCA Eng. 6, 29 (1984), 33–41.G Berger and R Memisevic. 2016. Incorporating long-range consistency in CNN-based

texture generation. (June 2016). arXiv:1606.01286François Chollet et al. 2015. Keras. https://github.com/fchollet/keras. (2015).Leon A Gatys, Matthias Bethge, Aaron Hertzmann, and Eli Shechtman. 2016a. Pre-

serving Color in Neural Artistic Style Transfer. (June 2016). arXiv:1606.05897Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2015a. A Neural Algorithm of

Artistic Style. (Aug. 2015). arXiv:1508.06576Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2015b. Texture Synthesis

Using Convolutional Neural Networks. NIPS cs.CV (2015).Leon A Gatys, Alexander S Ecker, Matthias Bethge, Aaron Hertzmann, and Eli Shecht-

man. 2016b. Controlling Perceptual Factors in Neural Style Transfer. (Nov. 2016).arXiv:1611.07865

Charles Han, Eric Risser, Ravi Ramamoorthi, and Eitan Grinspun. 2008. Multiscaletexture synthesis. ACM Transactions on Graphics 27, 3 (2008), 1.

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual Losses for Real-TimeStyle Transfer and Super-Resolution. In Computer Vision – ECCV 2016. Springer,Cham, Cham, 694–711.

Eric Jones, Travis Oliphant, Pearu Peterson, et al. 2001. SciPy: Open source scientifictools for Python. (2001). http://www.scipy.org/ [Online; accessed May 23, 2017].

Source image Our results [Gatys et al. 2015b]

Figure 4: Comparing our results to Gatys et al. [2015b] ona variety of source images with texture at multiple scales.Generated images are high resolution at 1024 × 1024 pix-els, so best viewed zoomed-in on screen. Our results matchagainst both layers block1_pool and block3_conv2 at 5 scaleoctaves. Gatys et al. matching layers conv1_1, block1_pool,block2_pool, block3_pool, block4_pool. Note that source tex-tures are slightly cropped for display herePhotos all creative commons licensed on Wikimedia Commons by users: HaleiLaihaweadu, MikeLynch, Vik Nanda,Brocken Inaglory, Jacopo Werther

Sylvain Lefebvre and Hugues Hoppe. 2005. Parallel controllable texture synthesis.ACM Transactions on Graphics (TOG) 24, 3 (July 2005), 777–786.

Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard S Zemel. 2016. Understanding theEffective Receptive Field in Deep Convolutional Neural Networks. NIPS (2016).

Javier Portilla and Eero P Simoncelli. 2000. A Parametric Texture Model Based onJoint Statistics of Complex Wavelet Coefficients. International Journal of ComputerVision 40, 1 (2000), 49–70.

Shunsuke Saito, Lingyu Wei, Liwen Hu, Koki Nagano, and Hao Li. 2016. PhotorealisticFacial Texture Inference UsingDeepNeural Networks. (Dec. 2016). arXiv:1612.00523

Omry Sendik and Daniel Cohen-Or. 2017. Deep Correlations for Texture synthesis.ACM Transactions on Graphics (April 2017), 1–15.

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networksfor Large-Scale Image Recognition. (Sept. 2014). arXiv:1409.1556

Ciyou Zhu, Richard H Byrd, Peihuang Lu, and Jorge Nocedal. 1997. Algorithm 778:L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization.ACM Transactions on Mathematical Software (TOMS) 23, 4 (Dec. 1997), 550–560.