8
Entropy, Free Energy, and Work of Restricted Boltzmann Machines Sangchul Oh, 1, * Abdelkader Baggag, 2, and Hyunchul Nha 3, 1 Qatar Environment and Energy Research Institute, Hamad Bin Khalifa University, Qatar Foundation, P.O.Box 5825, Doha, Qatar 2 Qatar Computing Research Institute, Hamad Bin Khalifa University, Qatar Foundation, P.O.Box 5825, Doha, Qatar 3 Department of Physics, Texas A&M University at Qatar, Education City, P.O.Box 23874, Doha, Qatar (Dated: April 13, 2020) A restricted Boltzmann machine is a generative probabilistic graphic network. A probability of finding the network in a certain configuration is given by the Boltzmann distribution. Given training data, its learning is done by optimizing parameters of the energy function of the network. In this paper, we analyze the training process of the restricted Boltzmann machine in the context of statistical physics. As an illustration, for small size Bar-and-Stripe patterns, we calculate thermodynamic quantities such as entropy, free energy, and internal energy as a function of training epoch. We demonstrate the growth of the correlation between the visible and hidden layers via the subadditivity of entropies as the training proceeds. Using the Monte-Carlo simulation of trajectories of the visible and hidden vectors in configuration space, we also calculate the distribution of the work done on the restricted Boltzmann machine by switching the parameters of the energy function. We discuss the Jarzynski equality which connects the path average of the exponential function of the work and the dierence in free energies before and after training. I. INTRODUCTION A restricted Boltzmann machine (RBM) [1] is a generative probabilistic neural network. RBMs and general Boltzmann machines are described by a probability distribution with pa- rameters, i.e., the Boltzmann distribution. An RBM is an undirected Markov random fields, and is considered a basic building block of deep neural network. RBMs have been ap- plied widely, for example, dimensional reduction, classifica- tion, feature learning, pattern recognition, topic modeling, and so on [24]. As its name implies, a RBM is closely connected to physics, and they shares some important concepts such as entropy, free energy, etc. [5]. Recently, RBMs have renewed much atten- tion in physics since Carleo and Troyer [6] showed that a quantum many-body state could be eciently represented by the RBM. Gabr´ e et al. and Tramel et al. [7] employed the Thouless-Anderson-Palmer mean-field approximation, used for a spin glass problem, to replace the Gibbs sampling of contrast-divergence training. Amin et al. [8] proposed a quan- tum Boltzmann machine based on quantum Boltzmann distri- bution of a quantum Hamiltonian. More interestingly, there is a deep connection between Boltzmann machine and tensor networks of quantum many-body systems [913]. Xia and Kais combined the restricted Boltzmann machine and quan- tum algorithms to calculate the electronic energy of small molecules [14]. While the working principles of RBMs have been well es- tablished, it may be still needed to understand the RBM better for further applications. In this paper, we investigate the RBM from the perspective of statistical physics. As an illustration, for bar-and-stripe pattern data, the thermodynamic quantities such as the entropy, the internal energy, the free energy, and * [email protected] [email protected] [email protected] the work, are calculated as a function of epoch. Since the RBM is a bipartite system composed of visible and hidden layers, it may be interesting, and informative, to see how the correlation between the two layers grows as the training goes on. We show that the total entropy of the RBM is always less than the sum of the entropies of visible and hidden lay- ers, except at the initial time when the training begins. This is the so-called subadditivity of entropy, indicating that the visible layer becomes correlated with the hidden layer as the training proceeds. The training of the RBM is to adjust the parameters of the energy function, which can be considered as the work done on the RBM, from a thermodynamic point of view. Using the Monte-Carlo simulation of the trajectories of the visible and hidden vectors in configuration space, we calculate the work of a single trajectory and the statistics of the work over the ensemble of trajectories. We also examine the Jarzynski equality that connects the ensemble of the work done on the RBM and the dierence in free energies before and after training of the RBM. The paper is organized as follows. In Section II, the detailed analysis of the RBM from the statistical physics point of view is described. In Section III, we presents the summary of the result together with discussions. II. STATISTICAL PHYSICS OF RESTRICTED BOLTZMANN MACHINES A. Restricted Boltzmann machines Let us start with a brief introduction of the RBM [13]. As shown in Fig. (1), the RBM is composed of two lay- ers; the visible layer and the hidden layer. Possible config- urations of the visible and hidden layers are represented by the random binary vectors, v = (v 1 ,..., v N ) ∈{0, 1} N and h = (h 1 ,..., h M ) ∈{0, 1} M , respectively. The interaction be- tween the visible and hidden layers is given by the so-called weight matrix w R N × R M where the weight w ij is the con- nection strength between a visible unit v i and a hidden unit arXiv:2004.04900v1 [cond-mat.dis-nn] 10 Apr 2020

Entropy, Free Energy, and Work of Restricted Boltzmann Machines · Entropy, Free Energy, and Work of Restricted Boltzmann Machines Sangchul Oh,1, Abdelkader Baggag,2, yand Hyunchul

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Entropy, Free Energy, and Work of Restricted Boltzmann Machines · Entropy, Free Energy, and Work of Restricted Boltzmann Machines Sangchul Oh,1, Abdelkader Baggag,2, yand Hyunchul

Entropy, Free Energy, and Work of Restricted Boltzmann Machines

Sangchul Oh,1, ∗ Abdelkader Baggag,2, † and Hyunchul Nha3, ‡

1Qatar Environment and Energy Research Institute, Hamad Bin Khalifa University, Qatar Foundation, P.O.Box 5825, Doha, Qatar2Qatar Computing Research Institute, Hamad Bin Khalifa University, Qatar Foundation, P.O.Box 5825, Doha, Qatar

3Department of Physics, Texas A&M University at Qatar, Education City, P.O.Box 23874, Doha, Qatar(Dated: April 13, 2020)

A restricted Boltzmann machine is a generative probabilistic graphic network. A probability of finding thenetwork in a certain configuration is given by the Boltzmann distribution. Given training data, its learning isdone by optimizing parameters of the energy function of the network. In this paper, we analyze the trainingprocess of the restricted Boltzmann machine in the context of statistical physics. As an illustration, for smallsize Bar-and-Stripe patterns, we calculate thermodynamic quantities such as entropy, free energy, and internalenergy as a function of training epoch. We demonstrate the growth of the correlation between the visible andhidden layers via the subadditivity of entropies as the training proceeds. Using the Monte-Carlo simulation oftrajectories of the visible and hidden vectors in configuration space, we also calculate the distribution of thework done on the restricted Boltzmann machine by switching the parameters of the energy function. We discussthe Jarzynski equality which connects the path average of the exponential function of the work and the differencein free energies before and after training.

I. INTRODUCTION

A restricted Boltzmann machine (RBM) [1] is a generativeprobabilistic neural network. RBMs and general Boltzmannmachines are described by a probability distribution with pa-rameters, i.e., the Boltzmann distribution. An RBM is anundirected Markov random fields, and is considered a basicbuilding block of deep neural network. RBMs have been ap-plied widely, for example, dimensional reduction, classifica-tion, feature learning, pattern recognition, topic modeling, andso on [2–4].

As its name implies, a RBM is closely connected to physics,and they shares some important concepts such as entropy, freeenergy, etc. [5]. Recently, RBMs have renewed much atten-tion in physics since Carleo and Troyer [6] showed that aquantum many-body state could be efficiently represented bythe RBM. Gabre et al. and Tramel et al. [7] employed theThouless-Anderson-Palmer mean-field approximation, usedfor a spin glass problem, to replace the Gibbs sampling ofcontrast-divergence training. Amin et al. [8] proposed a quan-tum Boltzmann machine based on quantum Boltzmann distri-bution of a quantum Hamiltonian. More interestingly, thereis a deep connection between Boltzmann machine and tensornetworks of quantum many-body systems [9–13]. Xia andKais combined the restricted Boltzmann machine and quan-tum algorithms to calculate the electronic energy of smallmolecules [14].

While the working principles of RBMs have been well es-tablished, it may be still needed to understand the RBM betterfor further applications. In this paper, we investigate the RBMfrom the perspective of statistical physics. As an illustration,for bar-and-stripe pattern data, the thermodynamic quantitiessuch as the entropy, the internal energy, the free energy, and

[email protected][email protected][email protected]

the work, are calculated as a function of epoch. Since theRBM is a bipartite system composed of visible and hiddenlayers, it may be interesting, and informative, to see how thecorrelation between the two layers grows as the training goeson. We show that the total entropy of the RBM is alwaysless than the sum of the entropies of visible and hidden lay-ers, except at the initial time when the training begins. Thisis the so-called subadditivity of entropy, indicating that thevisible layer becomes correlated with the hidden layer as thetraining proceeds. The training of the RBM is to adjust theparameters of the energy function, which can be consideredas the work done on the RBM, from a thermodynamic pointof view. Using the Monte-Carlo simulation of the trajectoriesof the visible and hidden vectors in configuration space, wecalculate the work of a single trajectory and the statistics ofthe work over the ensemble of trajectories. We also examinethe Jarzynski equality that connects the ensemble of the workdone on the RBM and the difference in free energies beforeand after training of the RBM.

The paper is organized as follows. In Section II, the detailedanalysis of the RBM from the statistical physics point of viewis described. In Section III, we presents the summary of theresult together with discussions.

II. STATISTICAL PHYSICS OF RESTRICTEDBOLTZMANN MACHINES

A. Restricted Boltzmann machines

Let us start with a brief introduction of the RBM [1–3].As shown in Fig. (1), the RBM is composed of two lay-ers; the visible layer and the hidden layer. Possible config-urations of the visible and hidden layers are represented bythe random binary vectors, v = (v1, . . . , vN) ∈ {0, 1}N andh = (h1, . . . , hM) ∈ {0, 1}M , respectively. The interaction be-tween the visible and hidden layers is given by the so-calledweight matrix w ∈ RN × RM where the weight wi j is the con-nection strength between a visible unit vi and a hidden unit

arX

iv:2

004.

0490

0v1

[co

nd-m

at.d

is-n

n] 1

0 A

pr 2

020

Page 2: Entropy, Free Energy, and Work of Restricted Boltzmann Machines · Entropy, Free Energy, and Work of Restricted Boltzmann Machines Sangchul Oh,1, Abdelkader Baggag,2, yand Hyunchul

2

v1 v2 vN

h1 h2 h3 hM

c1 c2 c3 cM

b1 b2 bN

wijHidden layer

Visible layer

FIG. 1: Graph structure of a restricted Boltzmann machinewith the visible layer and the hidden layer.

h j. The biases bi ∈ R. and c j ∈ R are applied to visible uniti and hidden unit j, respectively. Given random vectors v andh, the energy function of the RBM is written as an Ising-typeHamiltonian

E(v,h; θ) = −

N∑i=1

M∑j=1

wi jvih j −

N∑i=1

bivi −

M∑i=1

cihi , (1)

where the set of model parameters is denoted by θ ≡{wi j, bi, c j}. The joint probability of finding v and h of theRBM is given by the Boltzmann distribution

p(v,h; θ) =e−E(v,h;θ)

Z, (2)

where the partition function, Z(θ) ≡∑

v,h e−E(v,h;θ), is the sumover all possible configurations. The marginal probabilitiesp(v; θ) and p(h; θ) for visible and hidden layers are obtainedby summing up the hidden or visible variables, respectively,

p(v; θ) =∑

h

p(v,h; θ) =1

Z(θ)

∑h

e−E(v,h;θ) , (3a)

p(h; θ) =∑

vp(v,h; θ) =

1Z(θ)

∑v

e−E(v,h;θ) . (3b)

The training of the RBM is to adjust the model pa-rameter θ such that the marginal probability of the visiblelayer p(v; θ) becomes as close as possible to the unknownprobability pdata(v) that generate the training data. Givenidentically and independently sampled training data D ∈

{v(1), . . . , v(D)}, the optimal model parameters θ can be ob-tained by maximizing the likelihood function of the param-eters, L(θ|D) =

∏Di=1 p(v(i); θ), or equivalently by maximiz-

ing the log-likelihood function lnL(θ|D) =∑D

i=1 ln p(v(i); θ).Maximizing the likelihood function is equivalent to minimiz-ing the Kullback-Leibler divergence or the relative entropy of

p(v; θ) from q(v) [15, 16]

DKL(q || p) =∑

vq(v) ln

q(v)p(v; θ)

, (4)

where q(v) is an unknown probability that generates the train-ing data. Another method of monitoring the progress of train-ing is the cross-entropy cost between the input visible vectorv(i) and a reconstructed visible vector v(i) of the RBM,

C = −1D

∑i∈D

[v(i) ln v(i) + (1 − v(i)) ln(1 − v(i))

]. (5)

The stochastic gradient ascent method for the log-likelihoodfunction is used to train the RBM. Estimating the log-likelihood function requires the Monte-Carlo sampling for themodel probability distribution. Well-known sampling meth-ods are the contrast-divergence, denoted by CD-k, and the per-sistent contrast divergence PCD-k. For the detail of the RBMalgorithm, please see Refs. [2–4]. Here we employ the CD-kmethod.

B. Free energy, entropy, and internal energy

From physics point of view, the RBM is a finite classicalsystem composed of two subsystems, similar to an Ising spinsystem. The training of the RBM is considered the drivingof the system from an initial equilibrium state to the targetequilibrium state by switching the model parameters. It maybe interesting to see how thermodynamic quantities such asfree energy, entropy, internal energy, and work, change as thetraining progresses.

It is straightforward to write down various thermodynamicquantities for the total system. The free energy F is given bythe logarithm of the partition function Z,

F(θ) = − ln Z(θ) . (6)

The internal energy U is given by the expectation value of theenergy function E(v,h; θ)

U(θ) =∑v,h

E(v,h; θ)p(v,h; θ) . (7)

The entropy S of the total system comprising the hidden andvisible layers is given by

S (θ) = −∑v,h

p(v,h; θ) ln p(v,h; θ) . (8)

Here, the convention of 0 ln 0 = 1 is employed if p(v,h) = 0.The free energy (6) is related to the difference between theinternal energy (8) and the entropy (9)

F = U − TS , (9)

where T is set to 1.Generally, it is very challenging to calculate the thermody-

namic quantities, even numerically. The number of possible

Page 3: Entropy, Free Energy, and Work of Restricted Boltzmann Machines · Entropy, Free Energy, and Work of Restricted Boltzmann Machines Sangchul Oh,1, Abdelkader Baggag,2, yand Hyunchul

3

1

10 0

0 0 1 1

0 0

1

1

0

0

1

1

0

0 1 1

0 0 1

1

1

1

FIG. 2: 6 samples of 2 × 2 Bar-and-Stripe patterns used asthe training data in this work. Each configuration is

represented by a visible vector v ∈ {0, 1}2×2 or by a decimalnumber; (0, 0, 0, 0) = 0, (0, 0, 1, 1) = 3, (0, 1, 0, 1) = 5,

(1, 0, 1, 0) = 10, (1, 1, 0, 0) = 12, (1, 1, 1, 1) = 15 inrow-major ordering.

configurations of N visible units and M hidden units grow ex-ponentially as 2N+M . Here, for a feasible benchmark test, the2×2 Bar-and-Stripe data are considered [17, 18]. Fig. 2 showsthe 6 possible 2× 2 Bar-and-Stripe patterns out of 16 possibleconfigurations, which will be used as the training data in thiswork. We take the sizes of the visible and the hidden layersas N = 4 and M = 6, respectively. In order to understandbetter how the RBM is trained, the thermodynamic quantitiesare calculated numerically for this small benchmark system.

-3

-2

-1

0

1

2

3

4

5b1b2b3b4c1

c2c3c4c5c6

-10

-8

-6

-4

-2

0

2

4

6

8

0 2000 4000 6000 8000 10000

(a)

(b)

Epoch

w11w12w13w14w15w16w21w22w23w24w25w26

w31w32w33w34w35w36w41w42w43w44w45w46

FIG. 3: (a) Bias bi on the visible unit i and bias c j on thehidden unit j are plotted as a function of epoch. (b) Weightwi j connecting the visible unit i and the hidden unit j are

plotted as a function of epoch.

Fig. 3 shows how the weight wi j, the bias bi on the visibleunit i and the bias c j on the hidden unit j change as the train-ing goes on. The weight wi j are clustered into 3 classes. Theevolution of the bias bi on the visible layer is somewhat dif-ferent from that of the bias c j on the hidden layer. The changein ci are larger than that in bi. Fig. 4 shows the change inthe marginal probabilities p(v) of the visible layer and p(h)of the hidden layer before and after training. Note that themarginal probability p(v) after training is not distributed ex-clusively over 6 possible outcomes according to the trainingdata set in Fig. 2.

Typically, the progress of learning of the RBM is moni-

FIG. 4: Marginal probabilities p(v) of visible layer and p(h)of hidden layer are plotted (a) before training and (b) after

training. The binary vector v or h in x-axis is represented bythe decimal number as noted in the caption of Fig. 2. The

visible and the hidden layers have total number ofconfigurations given by 24 = 16 and 26 = 64, respectively.

The learning rate is 0.15, the training epoch 20000, and CD-k100.

tored by the loss function. Here the Kullback-Leibler diver-gence, Eq. (4) and the reconstructed cross entropy, Eq. (5) areare used. Fig. 5 plots the reconstructed cross entropy C, theKullback-Leibler divergence DKL, the entropy S , the free en-ergy F, and the internal energy U as a function of epoch. Asshown in Fig. 5 (a), it is interesting to see that even after a largenumber of epochs ∼ 10, 000, the cost function C continues ap-proaching zero while the entropy S and the Kullback-Leiblerdivergence DKL become steady. On the other hand, the free

Page 4: Entropy, Free Energy, and Work of Restricted Boltzmann Machines · Entropy, Free Energy, and Work of Restricted Boltzmann Machines Sangchul Oh,1, Abdelkader Baggag,2, yand Hyunchul

4

energy F continues decreasing together with the internal en-ergy U, as depicted in Fig. 5 (b). The Kullback-Leibler diver-gence is a well-known indicator to the performance of RBMs.Then, our result implies that the entropy may be another goodindicator to monitor the progress of RBM while other thermo-dynamic quantities may be not.

10-2

10-1

100

101

Entropy S

KL divergence

Cost function C

-20

-15

-10

-5

0

5

10

0 10000 20000

(a)

(b)

Epoch

Entropy S

Internal energy U

Free energy F

F-U+S

FIG. 5: For 2 × 2 bar-and-stripe data, (a) cost function C,entropy S , and the Kullback-Leibler divergence DKL(q || p)

are plotted as a function of epoch. (b) Free energy F, entropyS , and internal energy U of the RBM are calculated as a

function of epoch.

In addition to the thermodynamic quantities of the total sys-tem of the RBM, Eqs. (6), (8), and (7), it is interesting to seehow the two subsystems of the RBM evolve. Since the RBMhas no intra-layer connection, the correlation between the vis-ible layer and the hidden layer may increase as the trainingproceeds. The correlation between the visible layer and thehidden layer can be measured by the difference between thetotal entropy and the sum of the entropies of the two subsys-tems. The entropies of the visible and hidden layers are givenby

S V = −∑

vp(v; θ) ln p(v; θ) , (10a)

S H = −∑

h

p(h; θ) ln p(h; θ) . (10b)

The entropy S V of the visible layer is closely related to theKullback-Leibler divergence of p(v; θ) to an unknown proba-bility q(v) which produces the data. Eq. (4) is expanded as

DKL(q || p) =∑

vq(v) ln q(v) −

∑v

q(v) ln p(v; θ) . (11)

The second term −∑

v q(v) ln p(v; θ) depends on the parameterθ. As the training proceeds, p(v; θ) becomes close to q(v) sothe behavior of the second term is very similar to that of the

entropy S V of the visible layer. If the training is perfect, wehave q(v) = p(v; θ) that leads to DKL(q || p) = 0 while S Vremains nonzero.

10-3

10-2

10-1

100

101

DKL

SV

DKL

- SV

-2

-1

0

1

2

3

4

5

6

7

0 10000 20000

(a)

(b)

Epoch

S

SH

SV

SV -S

H

S -SV -S

H

FIG. 6: (a) Kullback-Leibler divergence DKL(q || p), entropyS V , and their difference are plotted as a function of epoch.(b) Entropy S of the total system, entropy S V of the visible

layer, entropy S H of the hidden layer, and the differenceS − S H − S V are plotted as a function of epoch.

The difference between the total entropy and the sum of theentropies of subsystems is written as

S − (S V + S H) =∑v,h

p(v,h) ln[

p(v) p(h)p(v,h)

]. (12)

Eq. (12) tells that if the visible random vector v and thehidden random vector h are independent, i.e., p(v,h; θ) =

p(v; θ)p(h; θ), then the entropy S of the total system is thesum of the entropies of subsystems. In general, the entropyS of the total system is always less than or equal to the sumof the entropy of the visible layer, S V , and the entropy of thehidden layer, S H , [19],

S ≤ S V + S H . (13)

This is called the subadditivity of entropy, one of the basicproperties of the Shannon entropy, which is also valid for thevon Neumann entropy [20, 21]. This property can be provedusing the log inequality, − ln x ≥ −x+1. In other way, Eq. (13)may be proved by using the log-sum inequality, which statesthat for the two sets of nonnegative numbers, a1, . . . , an andb1, . . . , bn, ∑

i

ai logai

bi≥

∑i

ai

log(∑

i ai)(∑

i bi) (14)

In other words, Eq. (12) can be regarded as the negative of therelative entropy or Kullback-Leibler divergence of the joint

Page 5: Entropy, Free Energy, and Work of Restricted Boltzmann Machines · Entropy, Free Energy, and Work of Restricted Boltzmann Machines Sangchul Oh,1, Abdelkader Baggag,2, yand Hyunchul

5

probability p(v,h) to the product probability p(v) · p(h),

I(p(v,h) || p(v)p(h)

)=

∑v,h

p(v,h) log[

p(v,h)p(v)p(h)

]. (15)

For the 2×2 Bar-and-Stripe pattern, the entropies of visibleand hidden layers, S V , S H are calculated numerically. Fig. 6plots the entropies, S V , S H , S , and the Kullback-Leibler di-vergence DKL(q || p) as a function of epoch. Fig. 6 (a) showsthat the Kullback-Leibler divergence, DKL(q || p) becomes sat-urated, though above zero, as the training proceeds. Similarly,the entropy S V of the visible layer is saturated. This impliesthat the entropy of the visible layer, as well as the total entropyshown in Fig. 5, can be an indicator to learning better than thereconstructed cross entropy C, Eq. (5). The same can also besaid about the entropy of the hidden layer, S H .

The difference between the total entropy and the sum of theentropies of the two subsystems, S − (S V + S H), becomes lessthan 0, as shown in Fig. 6 (b). Thus it demonstrates the sub-additivity of entropy, i.e., the correlation between the visibleand the hidden layer as the training proceeds. As it is satu-rated just as the total entropy and the entropies of the visibleand hidden layers after a large number of epoch, the correla-tion between the visible layer and the hidden layer can also bea good quantifier of the RBM progress.

C. Work, free energy, and Jarzynski equality

The training of the RBM may be viewed as driving a finiteclassical spin system from an initial equilibrium state to a finalequilibrium state by changing the system parameters θ slowly.If the parameters θ are switched infinitely slowly, the classicalsystem remains in quasi-static equilibrium. In this case, thetotal work done on the systems is equal to the Helmholtz freeenergy difference between the before-training and the after-training, W∞ = F1 − F0 . For switching θ at a finite rate, thesystem may not evolve immediately to an equilibrium state,the work done on the system depends on a specific path of thesystem in the configuration space. Jarzynski [22, 23] provedthat for any switching rate the free energy difference ∆F is re-lated to the average of the exponential function of the amountof work W over the paths

〈e−W〉path = e−∆F . (16)

The RBM is trained by changing the parameters θ through asequence {θ0, θ1, . . . , θτ}, as shown in Fig. 3. To calculate thework done during the training, we perform the Monte Carlosimulation of the trajectory of a state (v,h) of the RBM in con-figuration space. From the initial configuration, (v0,h0) whichis sampled from the initial Boltzmann distribution, Eq. (2),the trajectory (v0,h0) → (v1,h1) → · · · → (vτ,hτ) is ob-tained using the Metropolis-Hastings algorithm of the Markovchain Monte-Carlo method [24, 25]. Assuming the evolutionis Markovian, the probability of taking a specific trajectory is

the product of the transition probabilities at each step,

p(v0,h0θ1−→ v1,h1) p(v1,h1

θ2−→ v1,h1)

. . . p(vτ−1,hτ−1θτ−→ vτ,hτ) . (17)

The transition (v,h) → (v′,h′) can be implemented by theMetropolis-Hastings algorithm based on the detailed balancecondition for the fixed parameter θ,

p(v,hθ−→ v′,h′)

p(v,hθ←− v′,h′)

=e−E(v′,h′;θ)

e−E(v,h;θ) . (18)

The work done on the RBM at epoch i may be given by

δWi = E(vi,hi; θi+1) − E(vi,hi; θi) . (19)

The total work W =∑δWi performed on the system is written

as [26]

W =

τ−1∑i=0

[E(vi,hi; θi+1) − E(vi,hi; θi)] . (20)

0 5 10 15 20 25 30 35 40 45 50 55 60Hidden vector

02468

101214

Visib

le vec

tor

E(v,h)

−20

−18

−16

−14

−12

−10

FIG. 7: Heat map of energy function E(v,h; θ), representingthe energy level of each configuration, after training of 2 × 2

Bar-and-Stripe patterns for 50000 epochs. The sizes ofvisible and hidden layers are N = 4 and M = 6, respectively.The learning rate is r = 0.15 and the value of CDk is k = 100.

The vertical and the horizontal axes represent eachconfiguration of the visible and the hidden layers,

respectively. The black tiles represent the lowest energyconfigurations among all configurations, thus the probability

of finding that configuration is high.

Given the sequence of the model parameter {θ0, θ1, . . . , θτ},the Markov evolution of the visible and hidden vectors (v,h) ∈{0, 1}N+M may be considered the discrete random walk. Ran-dom walkers move to the points with low energy in configu-ration space. Fig. 7 shows the heat map of energy functionE(v,h; θ) of the RBM for the 2 × 2 Bar-and-Stripe pattern af-ter training. One can see the energy function has deep lev-els at the visible vectors corresponding to the Bar-and-Stripepatterns of the training data set in Fig. 2, representing prob-ability of generating the trained patterns is high. Also notethat the energy function has many local minima. Fig. 8 plotsa few Monte-Carlo trajectories of the visible vector v as afunction of epoch. Before training, the visible vector v is dis-tributed over all possible configurations, represented by the

Page 6: Entropy, Free Energy, and Work of Restricted Boltzmann Machines · Entropy, Free Energy, and Work of Restricted Boltzmann Machines Sangchul Oh,1, Abdelkader Baggag,2, yand Hyunchul

6

0

2

4

6

8

10

12

14

16

0 1000 2000 3000

Vis

ible

vec

tor

Epoch

FIG. 8: Markov chain Monte-Carlo trajectories of the visiblevector vi are plotted as a function of epoch. The visiblevector jumps frequently in the early state of training andbecomes trapped into one of target states as the training

proceeds.

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

-20 -15 -10 -5 0 5 10

# of Monte-Carlo sampling = 50000

Epoch = 5000

µ = -5.481

σ = 3.358

Pro

bab

ilit

y o

f w

ork

Work

FIG. 9: Gaussian distribution of work done by the RBMduring the training. The number of the Monte-Carlo

sampling is 50000. The red curve is the plot of the Gaussiandistribution using the mean and the standard deviation

calculated by the Monte-Carlo simulation.

number (0, · · · , 15). As the training progresses, the visiblevector v becomes trapped into one of the six possible out-comes (0, 3, 5, 10, 12, 15).

In order to examine the relation between work done onthe RBM during the training and the free energy difference,the Monte-Carlo simulation is performed to calculate the av-erage of the work over paths generated by the Metropolis-Hastings algorithm of the Markov chain Monte-Carlo method.Each path starts from an initial state sampled from the uni-form distribution over configuration space, as shown in Fig. 4

-25

-20

-15

-10

-5

0

5

0 5000 10000 15000 20000

Epoch

Monte-Carlo average of works

Free energy difference

FIG. 10: Average of work done with standard deviation andfree energy difference ∆F = F(epoch) − F(epoch = 0) as afunction of epoch. The error bar of the work represent the

standard deviation of the Gaussian distribution.

(a). Since the work done on the system depends on the path,the distribution of the work is calculated by generating manytrajectories. Fig. 9 shows the distribution of the work over50000 paths at 5000 training epoch. The Monte-Carlo aver-age of the work is 〈W〉 ≈ −5.481, and its standard deviationis σW ≈ 3.358. The distribution of the work generated bythe Monte-Carlo simulation is well fitted to the Gaussian dis-tribution, as depicted by the red curve in Fig. 9. This agreeswith the statement of in Ref. [23] that for the slow switchingof the model parameters the probability distribution of workis approximated to the Gaussian.

We perform the Monte-Carlo calculation of the exponen-tial average of work, 〈e−W〉path to check the Jarzynski equality,Eq. (16). The free energy difference can be estimated as

e−∆F = 〈e−W〉path ≈1

Nmc

Nmc∑n=1

e−Wn , (21)

where Nmc is the number of the Monte-Carlo samplings. Atsmall epoch number, the Monte-Carlo estimated value of freeenergy difference is close to ∆F calculated from the partitionfunction. However, this Monte-Carlo calculation gives rise tothe poor estimation of the free energy difference if the epoch isgreater than 5000. This numerical errors can be explained bythe fact that the exponential average of the work is dominatedby rare realization [27–31]. As shown in Fig. 9, the distribu-tion of work is given by the Gaussian distribution ρ(W) withthe mean 〈W〉 and the standard deviation σW . If the standarddeviation σW becomes larger, the peak position of ρ(W)e−W

moves to the long tail of the Gaussian distribution. So themain contrition of the integration of 〈e−W〉 comes from therare realizations. Fig. 10 shows that the standard deviation σWgrows with epoch, so the error of the Monte-Carlo estimationof the exponential average of the work grows quickly.

If σ2W � kBT , the free energy is related to the average of

Page 7: Entropy, Free Energy, and Work of Restricted Boltzmann Machines · Entropy, Free Energy, and Work of Restricted Boltzmann Machines Sangchul Oh,1, Abdelkader Baggag,2, yand Hyunchul

7

work and its variance as

∆F = 〈W〉path −σ2

W

2kBT. (22)

Here, the case is opposite, the spread of the value of work

is large, i.e., σ2W � kBT (= 1), so the central limit theorem

does not work and the above equation can not be applied [32].Fig. 10 shows how the average of work, 〈W〉path, over theMarkov chain Monte-Carlo paths changes as a function ofepoch. The standard deviation of the Gaussian distribution ofthe work also grows as a function of training epoch. The freeenergy difference between before-training and after-training iscalled the reversible work Wr = ∆F. The difference betweenthe actual work and the reversible work is called the dissipa-tive work, Wd = W − Wr [26]. As depicted in Fig. 10, themagnitude of the dissipative work grows with training epoch.

III. SUMMARY

In summary, we analyzed the training process of the RBMin the context of statistical physics. In addition to the typ-ical loss function, i.e., the reconstructed cross entropy, thethermodynamic quantities such as free energy F, internal en-ergy U, and entropy S were calculated as a function of epoch.While the free energy and the internal energy decrease ratherindefinitely with epochs, the total entropy and the entropiesof the visible and the hidden layers become saturated togetherwith the Kullback-Leibler divergence after a sufficient num-ber of epochs. This result suggests that the entropy of the

system may be a good indicator to the RBM progress alongwith Kullback-Leibler divergence. It seems worth investigat-ing the entropy for other larger data sets, for example, MNISThandwritten digits [33], in future works.

We have further demonstrated the subadditivity of the en-tropy, i.e., the entropy of the total system is less than the sumof the entropies of the two layers. This manifested the corre-lation between the visible and hidden layers growing with thetraining progress. Just as the entropies are well saturated to-gether with Kullback-Leibler divergence, so is the correlationthat is determined by the total and the local entropies. In thissense, the correlation between the visible and the hidden layermay become another good indicator to the RBM performance.

We also investigated the work done on the RBM by switch-ing the parameters of the energy function. The trajectoriesof the visible and hidden vectors in configuration space weregenerated using Markov chain Monte-Carlo simulation. Thedistribution of the work follows the Gaussian distribution andits standard deviation grows with training epochs. We dis-cussed the Jarzynski equality, which connects the free energydifference and the average of the exponential function of thework over the trajectories.

A more detailed analysis from a full thermodynamics or sta-tistical physics point of view can bring us useful insights intothe performance of RBM. This course of study may enable usto come up with possible methods for a better performance ofRBM for many different applications in the long run. There-fore, it may be worthwhile to further pursue our study, e.g.a rigorous assessment of scaling behavior of thermodynamicquantities with respect to epochs as the sizes of the visible andhidden layers increase. We also expect that a similar analysison a quantum Boltzmann machine can be valuable as well.

[1] P. Smolensky, “Information processing in dynamical systems:foundations of harmony theory,” in Parallel distributed process-ing: Explorations in the microstructure of cognition, edited byD. Rumelhart and J. McLelland (MIT Press, Cambridge, 1986)pp. 194–281.

[2] G. E. Hinton, “A practical guide to training restricted Boltz-mann machines,” in Neural Networks: Tricks of the Trade: Sec-ond Edition, edited by G. Montavon, G. B. Orr, and K.-R.Muller (Springer Berlin Heidelberg, Berlin, Heidelberg, 2012)pp. 599–619.

[3] A. Fischer and C. Igel, “Training restricted Boltzmann ma-chines: An introduction,” Pattern Recognition 47, 25–39(2014).

[4] J. Melchior, A. Fischer, and L. Wiskott, “How to center deepBoltzmann machines,” Journal of Machine Learning Research17, 1–61 (2016).

[5] P. Mehta, M. Bukov, C.-H. Wang, A. G. R. Day, C. Richardson,C. K. Fisher, and D. J. Schwab, “A high-bias, low-variance in-troduction to machine learning for physicists,” Physics Reports810, 1–124 (2019).

[6] G. Carleo and M. Troyer, “Solving the quantum many-bodyproblem with artificial neural networks,” Science 355, 602–606(2017).

[7] E. W. Tramel, M. Gabrie, A. Manoel, F. Caltagirone, andF. Krzakala, “Deterministic and generalized framework for un-

supervised learning with restricted boltzmann machines,” Phys.Rev. X 8, 041006 (2018).

[8] M. H. Amin, E. Andriyash, J. Rolfe, B. Kulchytskyy, andR. Melko, “Quantum boltzmann machine,” Phys. Rev. X 8,021050 (2018).

[9] E. Stoudenmire and D. J. Schwab, “Supervised learning withtensor networks,” in Advances in Neural Information Process-ing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V.Luxburg, I. Guyon, and R. Garnett (Curran Associates, Inc.,2016) pp. 4799–4807.

[10] X. Gao and L.-M. Duan, “Efficient representation of quantummany-body states with deep neural networks,” Nature Commu-nications 8, 662 (2017).

[11] J. Chen, S. Cheng, H. Xie, L. Wang, and T. Xiang, “Equiv-alence of restricted Boltzmann machines and tensor networkstates,” Phys. Rev. B 97, 085104 (2018).

[12] S. Das Sarma, D.-L. Deng, and L.-M. Duan, “Machine learningmeets quantum physics,” Physics Today 72, 48–54 (2019).

[13] W. Huggins, P. Patil, B. Mitchell, K. B. Whaley, and E. M.Stoudenmire, “Towards quantum machine learning with ten-sor networks,” Quantum Science and Technology 4, 024001(2019).

[14] R. Xia and S. Kais, “Quantum machine learning for elec-tronic structure calculations,” Nature Communications 9, 4195(2018).

Page 8: Entropy, Free Energy, and Work of Restricted Boltzmann Machines · Entropy, Free Energy, and Work of Restricted Boltzmann Machines Sangchul Oh,1, Abdelkader Baggag,2, yand Hyunchul

8

[15] S. Kullback and R. A. Leibler, “On information and suffi-ciency,” Ann. Math. Statist. 22, 79–86 (1951).

[16] T. M. Cover and J. A. Thomas, Elements of Information Theory,2nd ed. (Wiley, New York, 2006).

[17] G. E. Hinton and T. J. Sejnowski, “Learning and relearning inBoltzmann machines,” in Parallel distributed processing: Ex-plorations in the microstructure of cognition, edited by D. E.Rumelhart and J. L. McLelland (MIT Press, Cambridge, 1986)pp. 282–317.

[18] D. J. C. MacKay, Information Theory, Inference & LearningAlgorithms (Cambridge University Press,New York, 2002).

[19] F. Reif, Fundamentals of Statistical and Thermal Physics (Mc-Graw Hill, New York, 1965).

[20] H. Araki and E. H. Lieb, “Entropy inequalities,” Communica-tions in Mathematical Physics 18, 160–170 (1970).

[21] Michael A. Nielsen and Isaac L. Chuang, Quantum Computa-tion and Quantum Information (Cambridge University Press,2000).

[22] C. Jarzynski, “Nonequilibrium equality for free energy differ-ences,” Phys. Rev. Lett. 78, 2690–2693 (1997).

[23] C. Jarzynski, “Equalities and inequalities: Irreversibility andthe second law of thermodynamics at the nanoscale,” AnnualReview of Condensed Matter Physics 2, 329–351 (2011).

[24] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H.Teller, and E. Teller, “Equation of state calculations by fastcomputing machines,” The Journal of Chemical Physics 21,1087–1092 (1953).

[25] W. K. Hastings, “Monte carlo sampling methods using markovchains and their applications,” Biometrika 57, 97–109 (1970).

[26] G. E. Crooks, “Nonequilibrium measurements of free energydifferences for microscopically reversible markovian systems,”Journal of Statistical Physics 90, 1481 – 1487 (1998).

[27] C. Jarzynski, “Rare events and the convergence of exponentiallyaveraged work values,” Phys. Rev. E 73, 046105 (2006).

[28] D. M. Zuckerman and T. B. Woolf, “Theory of a systematiccomputational error in free energy differences,” Phys. Rev. Lett.89, 180602 (2002).

[29] W. Lechner, H. Oberhofer, C. Dellago, and P. L. Geissler,“Equilibrium free energies from fast-switching trajectories withlarge time steps,” The Journal of Chemical Physics 124, 044113(2006).

[30] W. Lechner and C. Dellago, “On the efficiency of path sam-pling methods for the calculation of free energies from non-equilibrium simulations,” Journal of Statistical Mechanics:Theory and Experiment 2007, P04001–P04001 (2007).

[31] Y. N. Halpern and C. Jarzynski, “Number of trials required toestimate a free-energy difference, using fluctuation relations,”Phys. Rev. E 93, 052144 (2016).

[32] D. A. Hendrix and C. Jarzynski, “A fast growth method ofcomputing free energy differences,” The Journal of ChemicalPhysics 114, 5974–5981 (2001).

[33] Y. LeCun, C. Cortes, and C.J. Burges, “Mnist handwritten digitdatabase,” ATT Labs [Online]. Available: http://yann. lecun.com/exdb/mnist 2 (2010).