1
Learning Non-Convergent Non-Persistent Short-Run MCMC Toward Energy-Based Model Erik Nijkamp, Mitch Hill, Song-Chun Zhu, Ying Nian Wu University of California, Los Angeles Abstract This paper studies a curious phenomenon in learning energy-based model (EBM) using MCMC. In each learning iteration, we generate synthesized examples by running a non-convergent, non-mixing, and non-persistent short-run MCMC toward the current model, always starting from the same initial distribution such as uniform noise distribution, and always running a fixed number of MCMC steps. After generating synthesized examples, we then update the model parameters according to the maximum likelihood learning gradient, as if the synthesized examples are fair samples from the current model. We treat this non-convergent short-run MCMC as a learned generator model or a flow model. We provide arguments for treating the learned non-convergent short-run MCMC as a valid model. We show that the learned short-run MCMC is capable of generating realistic images. More interestingly, unlike traditional EBM or MCMC, the learned short-run MCMC is capable of reconstructing observed images and interpolating between images, like generator or flow models. Maximum Likelihood Learning of EBM Probability Density Let x be the signal, such as an image. The energy-based model (EBM) is a Gibbs distribution p θ (x)= 1 Z (θ ) exp( f θ (x)), (1) where we assume x is within a bounded range. f θ (x) is the negative energy and is parametrized by a bottom-up convolutional neural network (ConvNet) with weights θ . Z (θ )= R exp( f θ (x))dx is the normalizing constant. Analysis by Synthesis Suppose we observe training examples x i , i = 1, ..., n p data , where p data is the data distribution. For large n, the sample average over {x i } approximates the expectation with respect with p data . The log-likelihood is L(θ )= 1 n n i=1 log p θ (x i ) . = E p data [log p θ (x)]. (2) The derivative of the log-likelihood is L 0 (θ )= E p data ∂θ f θ (x) - E p θ ∂θ f θ (x) . = 1 n n i=1 ∂θ f θ (x i ) - 1 n n i=1 ∂θ f θ (x - i ), (3) where x - i p θ (x) for i = 1, ..., n are the generated examples from the current model p θ (x). The above equation leads to the “analysis by synthesis” learning algorithm. At iteration t , let θ t be the current model parameters. We generate x - i p θ t (x) for i = 1, ..., n. Then we update θ t +1 = θ t + η t L 0 (θ t ), where η t is the learning rate. Short-Run MCMC Sampling by Langevin Dynamics Generating synthesized examples x - i p θ (x) requires MCMC, such as Langevin dynamics, which iterates x τ +Δτ = x τ + Δτ 2 f 0 θ (x τ )+ Δτ U τ , (4) where τ indexes the time, Δτ is the discretization of time, and U τ N(0, I ) is the Gaussian noise term. Guidance by Energy-based Model If f θ (x) is multi-modal, then different chains tend to get trapped in different local modes, and they do not mix. We propose to give up the sampling of p θ . Instead, we run a fixed number, e.g., K , steps of MCMC, toward p θ , starting from a fixed initial distribution, p 0 , such as the uniform noise distribution. Let M θ be the K -step MCMC transition kernel. Define q θ (x)=(M θ p 0 )(z)= Z p 0 (z)M θ (x|z)dz, (5) which is the marginal distribution of the sample x after running K -step MCMC from p 0 . Instead of learning p θ , we treat q θ to be the target of learning. After learning, we keep q θ , but we discard p θ . That is, the sole purpose of p θ is to guide a K -step MCMC from p 0 . Learning Short-Run MCMC The learning algorithm is as follows. Initialize θ 0 . At learning iteration t , let θ t be the model parameters. We generate x - i q θ t (x) for i = 1, ..., m. Then we update θ t +1 = θ t + η t Δ(θ t ), where Δ(θ )= E p data ∂θ f θ (x) - E q θ ∂θ f θ (x) m i=1 ∂θ f θ (x i ) - m i=1 ∂θ f θ (x - i ). (6) The learning procedure is simple. The key to the algorithm is that the generated {x - i } are independent and fair samples from the model q θ . Algorithm 1: Learning short-run MCMC. input : Negative energy f θ (x), training steps T , initial weights θ 0 , observed examples {x i } n i=1 , batch size m, variance of noise σ 2 , Langevin descretization Δτ and steps K , learning rate η . output : Weights θ T +1 . for t = 0: T do 1. Draw observed images {x i } m i=1 . 2. Draw initial negative examples {x - i } m i=1 p 0 . 3. Update observed examples x i x i + ε i where ε i N(0, σ 2 I ). 4. Update negative examples {x - i } m i=1 for K steps of Langevin dynamics (4). 5. Update θ t by θ t +1 = θ t + g(Δ(θ t ), η , t ) where gradient Δ(θ t ) is (6) and g is ADAM. Relation to Moment Matching Estimator We may interpret Short-Run MCMC as Moment Matching Estimator. We outline the case of a learning the top-filters of a ConvNet: Consider f θ (x)= hθ , h(x)i where h(x) are the top-layer filter responses of a pretrained ConvNet with top-layer weights θ . For such f θ (x), we have ∂θ f θ (x)= h(x). The MLE estimator of p θ is a moment-matching estimator, i.e. E p ˆ θ MLE [h(x)] = E p data [h(x)]. If we use the short-run MCMC learning algorithm, it will converge (as- sume convergence is attainable) to a moment matching estimator, i.e., E q ˆ θ MME [h(x)] = E p data [h(x)]. Thus, the learned model q ˆ θ MME (x) is a valid estimator in that it matches to the data distribution in terms of sufficient statistics defined by the EBM. Figure 1: The blue curve illustrates the model distributions corresponding to different values of parameter θ . The black curve illustrates all the distributions that match p data (black dot) in terms of E[h(x)]. The MLE p ˆ θ MLE (green dot) is the intersection between Θ (blue curve) and Ω (black curve). The MCMC (red dotted line) starts from p 0 (hollow blue dot) and runs toward p ˆ θ MME (hollow red dot), but the MCMC stops after K -step, reaching q ˆ θ MME (red dot), which is the learned short-run MCMC. Relation to Generator Model We may consider q θ (x) to be a generative model, z p 0 (z); x = M θ (z, u), (7) where u denotes all the randomness in the short-run MCMC. For the K -step Langevin dynamics, M θ can be considered a K -layer noise-injected residual network. z can be considered latent variables, and p 0 the prior distribution of z. Due to the non-convergence and non-mixing, x can be highly dependent on z, and z can be inferred from x. Interpolation We can perform interpolation as follows. Generate z 1 and z 2 from p 0 (z). Let z ρ = ρ z 1 + p 1 - ρ 2 z 2 . This interpolation keeps the marginal variance of z ρ fixed. Let x ρ = M θ (z ρ ). Then x ρ is the interpolation of x 1 = M θ (z 1 ) and x 2 = M θ (z 2 ). Figure 3 displays x ρ for a sequence of ρ [0, 1]. Reconstruction For an observed image x, we can reconstruct x by running gradient descent on the least squares loss function L(z)= kx - M θ (z)k 2 , initializing from z 0 p 0 (z), and iterates z t +1 = z t - η t L 0 (z t ). Figure 4 displays the sequence of x t = M θ (z t ). Capability 1: Synthesis Figure 2: Generating synthesized examples by running 100 steps of Langevin dynamics initialized from uniform noise for CelebA (64 × 64). Capability 2: Interpolation Figure 3: M θ (z ρ ) with interpolated noise z ρ = ρ z 1 + p 1 - ρ 2 z 2 where ρ [0, 1] on CelebA (64 × 64). Left: M θ (z 1 ). Right: M θ (z 2 ). Capability 3: Reconstruction Figure 4: M θ (z t ) over time t from random initialization t = 0 to reconstruction t = 200 on CelebA. Left: Random initialization. Right: Observed examples. Conclusion (1) We propose to shift the focus from convergent MCMC towards efficient, non-converging, non-mixing, short-run MCMC guided by EBM. (2) We interpret short-run MCMC as Moment Matching Estimator and explore the relations to residual networks and generator-based models. (3) We demonstrate the abilities of interpolation and reconstruction due to non-mixing MCMC, which goes far beyond the capacity of convergent MCMC. References J Xie*, Y Lu*, SC Zhu, YN Wu. A Theory of Generative ConvNet, ICML 2016. R Gao*, Y Lu, J Zhou, SC Zhu, YN Wu. Learning generative ConvNets via Multigrid Modeling and Sampling, CVPR 2018. E Nijkamp*, M Hill*, SC Zhu, YN Wu. On the Anatomy of MCMC-based Maximum Likelihood Learning of Energy-Based Models, AAAI 2020.

t x p x q g q h t q g z kx z k z p z fxi gi x m i q i q r m im r q r q n i n i … · the Gaussian noise term. Guidance by Energy-based Model If fq (x) is multi-modal, then different

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: t x p x q g q h t q g z kx z k z p z fxi gi x m i q i q r m im r q r q n i n i … · the Gaussian noise term. Guidance by Energy-based Model If fq (x) is multi-modal, then different

Learning Non-Convergent Non-Persistent Short-Run MCMC Toward Energy-Based ModelErik Nijkamp, Mitch Hill, Song-Chun Zhu, Ying Nian Wu

University of California, Los Angeles

AbstractThis paper studies a curious phenomenon in learning energy-based model (EBM)using MCMC. In each learning iteration, we generate synthesized examples byrunning a non-convergent, non-mixing, and non-persistent short-run MCMCtoward the current model, always starting from the same initial distributionsuch as uniform noise distribution, and always running a fixed number ofMCMC steps. After generating synthesized examples, we then update themodel parameters according to the maximum likelihood learning gradient, asif the synthesized examples are fair samples from the current model. We treatthis non-convergent short-run MCMC as a learned generator model or a flowmodel. We provide arguments for treating the learned non-convergent short-runMCMC as a valid model. We show that the learned short-run MCMC is capableof generating realistic images. More interestingly, unlike traditional EBM orMCMC, the learned short-run MCMC is capable of reconstructing observedimages and interpolating between images, like generator or flow models.

Maximum Likelihood Learning of EBMProbability DensityLet x be the signal, such as an image. The energy-based model (EBM) is aGibbs distribution

pθ (x) =1

Z(θ)exp( fθ (x)), (1)

where we assume x is within a bounded range. fθ (x) is the negative energy andis parametrized by a bottom-up convolutional neural network (ConvNet) withweights θ . Z(θ) =

∫exp( fθ (x))dx is the normalizing constant.

Analysis by SynthesisSuppose we observe training examples xi, i = 1, ...,n∼ pdata, where pdata is thedata distribution. For large n, the sample average over xi approximates theexpectation with respect with pdata.The log-likelihood is

L(θ) =1n

n

∑i=1

log pθ (xi).= Epdata [log pθ (x)]. (2)

The derivative of the log-likelihood is

L′(θ) = Epdata

[∂

∂θfθ (x)

]−Epθ

[∂

∂θfθ (x)

].=

1n

n

∑i=1

∂θfθ (xi)−

1n

n

∑i=1

∂θfθ (x−i ),

(3)

where x−i ∼ pθ (x) for i = 1, ...,n are the generated examples from the currentmodel pθ (x).The above equation leads to the “analysis by synthesis” learning algorithm. Atiteration t, let θt be the current model parameters. We generate x−i ∼ pθt (x) fori = 1, ...,n. Then we update θt+1 = θt +ηtL′(θt), where ηt is the learning rate.

Short-Run MCMCSampling by Langevin DynamicsGenerating synthesized examples x−i ∼ pθ (x) requires MCMC, such asLangevin dynamics, which iterates

xτ+∆τ = xτ +∆τ

2f ′θ (xτ)+

√∆τUτ , (4)

where τ indexes the time, ∆τ is the discretization of time, and Uτ ∼ N(0, I) isthe Gaussian noise term.

Guidance by Energy-based ModelIf fθ (x) is multi-modal, then different chains tend to get trapped in differentlocal modes, and they do not mix. We propose to give up the sampling of pθ .Instead, we run a fixed number, e.g., K, steps of MCMC, toward pθ , startingfrom a fixed initial distribution, p0, such as the uniform noise distribution. LetMθ be the K-step MCMC transition kernel. Define

qθ (x) = (Mθ p0)(z) =∫

p0(z)Mθ (x|z)dz, (5)

which is the marginal distribution of the sample x after running K-step MCMCfrom p0.Instead of learning pθ , we treat qθ to be the target of learning. After learning,we keep qθ , but we discard pθ . That is, the sole purpose of pθ is to guide aK-step MCMC from p0.

Learning Short-Run MCMCThe learning algorithm is as follows. Initialize θ0. At learning iteration t, letθt be the model parameters. We generate x−i ∼ qθt (x) for i = 1, ...,m. Then weupdate θt+1 = θt +ηt∆(θt), where

∆(θ) = Epdata

[∂

∂θfθ (x)

]−Eqθ

[∂

∂θfθ (x)

]≈

m

∑i=1

∂θfθ (xi)−

m

∑i=1

∂θfθ (x−i ).

(6)

The learning procedure is simple. The key to the algorithm is that the generatedx−i are independent and fair samples from the model qθ .

Algorithm 1: Learning short-run MCMC.input :Negative energy fθ (x), training steps T , initial weights θ0, observed

examples xini=1, batch size m, variance of noise σ2, Langevin descretization ∆τ

and steps K, learning rate η .output :Weights θT+1.for t = 0 : T do

1. Draw observed images ximi=1.

2. Draw initial negative examples x−i mi=1 ∼ p0.

3. Update observed examples xi← xi + εi where εi ∼ N(0,σ2I).4. Update negative examples x−i m

i=1 for K steps of Langevin dynamics (4).5. Update θt by θt+1 = θt +g(∆(θt),η , t) where gradient ∆(θt) is (6) and g is ADAM.

Relation to Moment Matching EstimatorWe may interpret Short-Run MCMC as Moment Matching Estimator. We outlinethe case of a learning the top-filters of a ConvNet:

• Consider fθ (x) = 〈θ ,h(x)〉 where h(x) are the top-layer filter responsesof a pretrained ConvNet with top-layer weights θ .

• For such fθ (x), we have ∂

∂θfθ (x) = h(x).

• The MLE estimator of pθ is a moment-matching estimator, i.e.Ep

θMLE[h(x)] = Epdata [h(x)].

• If we use the short-run MCMC learning algorithm, it will converge (as-sume convergence is attainable) to a moment matching estimator, i.e.,Eq

θMME[h(x)] = Epdata [h(x)].

• Thus, the learned model qθMME

(x) is a valid estimator in that it matches tothe data distribution in terms of sufficient statistics defined by the EBM.

Figure 1: The blue curve illustrates the model distributions corresponding to differentvalues of parameter θ . The black curve illustrates all the distributions that match pdata(black dot) in terms of E[h(x)]. The MLE p

θMLE(green dot) is the intersection between

Θ (blue curve) and Ω (black curve). The MCMC (red dotted line) starts from p0 (hollowblue dot) and runs toward p

θMME(hollow red dot), but the MCMC stops after K-step,

reaching qθMME

(red dot), which is the learned short-run MCMC.

Relation to Generator ModelWe may consider qθ (x) to be a generative model,

z∼ p0(z); x = Mθ (z,u), (7)

where u denotes all the randomness in the short-run MCMC. For the K-stepLangevin dynamics, Mθ can be considered a K-layer noise-injected residualnetwork. z can be considered latent variables, and p0 the prior distribution of z.Due to the non-convergence and non-mixing, x can be highly dependent on z,and z can be inferred from x.InterpolationWe can perform interpolation as follows. Generate z1 and z2 from p0(z). Letzρ = ρz1 +

√1−ρ2z2. This interpolation keeps the marginal variance of zρ

fixed. Let xρ = Mθ (zρ). Then xρ is the interpolation of x1 = Mθ (z1) andx2 = Mθ (z2). Figure 3 displays xρ for a sequence of ρ ∈ [0,1].ReconstructionFor an observed image x, we can reconstruct x by running gradient descent onthe least squares loss function L(z) = ‖x−Mθ (z)‖2, initializing from z0 ∼ p0(z),and iterates zt+1 = zt−ηtL′(zt). Figure 4 displays the sequence of xt = Mθ (zt).

Capability 1: Synthesis

Figure 2: Generating synthesized examples by running 100 steps of Langevin dynamicsinitialized from uniform noise for CelebA (64×64).

Capability 2: Interpolation

Figure 3: Mθ (zρ ) with interpolated noise zρ = ρz1 +√

1−ρ2z2 where ρ ∈ [0,1] onCelebA (64×64). Left: Mθ (z1). Right: Mθ (z2).

Capability 3: Reconstruction

Figure 4: Mθ (zt) over time t from random initialization t = 0 to reconstruction t = 200on CelebA. Left: Random initialization. Right: Observed examples.

Conclusion(1) We propose to shift the focus from convergent MCMC towards efficient,non-converging, non-mixing, short-run MCMC guided by EBM.(2) We interpret short-run MCMC as Moment Matching Estimator and explorethe relations to residual networks and generator-based models.(3) We demonstrate the abilities of interpolation and reconstruction due tonon-mixing MCMC, which goes far beyond the capacity of convergent MCMC.

References• J Xie*, Y Lu*, SC Zhu, YN Wu. A Theory of Generative ConvNet, ICML 2016.

• R Gao*, Y Lu, J Zhou, SC Zhu, YN Wu. Learning generative ConvNets viaMultigrid Modeling and Sampling, CVPR 2018.

• E Nijkamp*, M Hill*, SC Zhu, YN Wu. On the Anatomy of MCMC-basedMaximum Likelihood Learning of Energy-Based Models, AAAI 2020.