41
Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) 2019 Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University Version: 20170926 è 20170928è20171011 è 20191001

Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Chapter 11. Stochastic Methods Rooted

in Statistical Mechanics

Neural Networks and Learning Machines (Haykin)

2019 Lecture Notes on Self-learning Neural Algorithms

Byoung-Tak ZhangSchool of Computer Science and Engineering

Seoul National University

Version: 20170926 è 20170928è20171011 è 20191001

Page 2: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Contents11.1 Introduction …………………………………………………………………………….... 311.2 Statistical Mechanics ……………………………………………………………..…. 411.3 Markov Chains ………………………..………………………………………..…….... 611.4 Metropolis Algorithm ……….………..………………….……………………..... 16

11.5 Simulated Annealing ………………………………………………….…….…...…. 1911.6 Gibbs Sampling ….…………….…….………………..……..……………………….. 2211.7 Boltzmann Machine …..……………………………………….……..…………….. 2411.8 Logistic Belief Nets ……………….…………………..……………….…......……. 2911.9 Deep Belief Nets ………………………….…………………………..….........…. 30

11.10 Deterministic Annealing (DA) …………….………………………..…….…... 3411.11 Analogy of DA with EM …..……….….…………………………………….……. 39Summary and Discussion …………….…………….………………………….………... 41

(c) 2017 Biointelligence Lab, SNU 2

Page 3: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.1 Introduction

(c) 2017 Biointelligence Lab, SNU 3

• Statistical mechanics as a source of ideas for unsupervised (self-organized) learning systems

• Statistical mechanicsü The formal study of macroscopic equilibrium properties of large

systems of elements that are subject to the microscopic laws of mechanics.

ü The number of degrees of freedom is enormous, making the use of probabilistic methods mandatory.

ü The concept of entropy plays a vital role in statistical mechanics, as with the Shannon’s information theory.

ü The more ordered the system, or the more concentrated the underlying probability distribution, the smaller the entropy will be.

• Statistical mechanics for the study of neural networksü Cragg and Temperley (1954) and Cowan (1968)ü Boltzmann machine (Hinton and Sejnowsky, 1983, 1986; Ackley et al.,

1985)

Page 4: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.2 Statistical Mechanics (1/2)

(c) 2017 Biointelligence Lab, SNU 4!!

pi : !probability!of!occurrence!of!state!i !of!a!stochastic!system!!!!!pi ≥0!(for!all!i)!!and! pi

i∑ =1

Ei : !energy!of!the!system!when!it!is!in!state!iIn!thermal!equilibrium,!the!probability!of!state!i !is(Canonical!distribution!/!Gibbs!distribution)

!!!!!pi =1Zexp −

EikBT

⎝⎜⎞

⎠⎟

!!!!!Z = exp −EikBT

⎝⎜⎞

⎠⎟i∑

exp −E /kBT( ): !Boltzmann!factor!Z : !sum!over!states!(partition!function)

We!set!kB =1!and!view!− logpi !as!"energy"

1. States of low energy have a higher probability of occurrence than the states of high energy.

2. As the temperature T is reduced, the probability is concentrated on a smaller subset of low-energy states.

Page 5: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.2 Statistical Mechanics (2/2)

(c) 2017 Biointelligence Lab, SNU 5!!!

Helmholtz!free!energy!!!!!!!F = −T logZ< E > ! = piEi

i∑ !!!!!!(avergage!energy)

!!!!!! < E > − !F = −T pi logpii∑

H = − pi logpii∑ !!!!!(entropy)

Thus,!we!have!!!!!! < E > − !F =TH!!!!!!!F = ! < E > − !TH

Consider!two!systems!A!and!A'!in!thermal!contact.ΔH !and!ΔH ': !entropy!changes!of!A!and!A'!The!total!entropy!tends!to!increase!with!!!!!!!ΔH +ΔH '≥0

!!!

The!free!energy!of!the!system,!F ,!tends!to!decrease!andbecome!a!minimum!in!an!equilibrium!situation.!The!resulting!probability!distribution!is!defined!by!Gibbs!distribution!(The!Principle!of!Minimum!Free!Energy).!

Nature likes to find a physical system with minimum free energy.

Page 6: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

(c) 2017 Biointelligence Lab, SNU 6

Markov property P( Xn+1 = xn+1 | Xn = xn ,…, X1 = x1) = P( Xn+1 = xn+1 | Xn = xn )

Transition probability from state i at time n to j at time n+1 pij = P( Xn+1 = j | Xn = i)

(pij ≥ 0 ∀i, j and pij = 1 ∀ij∑ )

If the transition probabilities are fixed, the Markov chain is homogeneous.In case of a system with a finite number of possible states K , the transition probabilities constitute a K-by-K matrix (stochastic matrix):

P =

p11 … p1K

! " !pK1 # pKK

⎜⎜⎜

⎟⎟⎟

11.3 Markov Chains (1/9)

Page 7: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.3 Markov Chains (2/9)

(c) 2017 Biointelligence Lab, SNU 7

Generalization to m-step transition probability

pij(m) = P( Xn+m = x j | Xn = xi ), m = 1,2,…

pij(m+1) = pik

(m) pkj , m = 1,2,…, pik(1) = pikk∑

We can further generalize to (Chapman-Kolmogorov identity)

pij(m+n) = pik

(m) pkj(n)

k∑ , m,n = 1,2,…

lim

k→∞vi(k) = π i i = 1,2,…, K

Page 8: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.3 Markov Chains (3/9)

(c) 2017 Biointelligence Lab, SNU 8

Properties of markov chains

Recurrent pi = P(every returning to state i)Transient pi <1

Periodic

If i ∈Sk and pi > 0, thenj ∈Sk+1, for k = 1,...,d -1

j ∈Sk , for k = 1,...,d

⎧⎨⎪

⎩⎪

AperiodicAccessable: Accessable from i if there is a finite sequence of transition from i to jCommunicate: If the states i and j are accessible to each otherIf two states communicate each other, they belong to the same class.If all the states consists of a single class, the Markov chain is indecomposible or irreducible.

Page 9: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.3 Markov Chains (4/9)

Figure 11.1: A periodic recurrent Markov chain with d = 3.

(c) 2017 Biointelligence Lab, SNU 9

Page 10: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.3 Markov Chains (5/9)

(c) 2017 Biointelligence Lab, SNU 10

Ergodic Markov chains Ergodicity: time average = ensemble average

i.e. long-term proportion of time spent by the chain in state i corresponds to the steady-state probability π i

vi(k) : Proportion of time spent in state i after k returns

vi(k) = kTi(ℓ)ℓ=1

k∑

limk→∞

vi(k) = π i i = 1,2,…, K

Page 11: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.3 Markov Chains (6/9)

(c) 2017 Biointelligence Lab, SNU 11

Convergence to Stationary DistributionsConsider an ergodic Markov chain with a stochastic matrix P

π (n−1) : state transition vector of the chain at time n -1State transition vector at time n is

π (n) = π (n−1)PBy iteration we obtain

π (n) = π (n−1)P = π (n−2)P2 = π (n−3)P3 =! π (n) = π (0)Pn

π (0) : initial value

limn→∞

Pn =

π1 … π K

" # "π1 ! π K

⎜⎜⎜

⎟⎟⎟=

π"π

⎜⎜⎜

⎟⎟⎟

Ergodic theorem

1. limn→∞

pij(n) = π j ∀i

2. π j > 0 ∀j

3. π j = 1j=1

K∑ 4. π j = π i piji=1

K∑ for j = 1,2,…, K

Page 12: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.3 Markov Chains (6/9)

Figure 11.2: State-transition diagram of Markov chain for Example 1: The states x1 and x2 and may be identified as up-to-date behind, respectively.

12

!!

P=

14

34

12

12

⎢⎢⎢⎢

⎥⎥⎥⎥

!!

π (0) = 16

56

⎣⎢⎢

⎦⎥⎥

π (1) =π (1)P

!!!!!!! = ! 16

56

⎣⎢⎢

⎦⎥⎥

14

34

12

12

⎢⎢⎢⎢

⎥⎥⎥⎥

!!!!!! = ! 1124

1324

⎣⎢⎢

⎦⎥⎥

!!

P(2) = 0.4375 0.56250.3750 0.6250

⎣⎢

⎦⎥

P(3) = 0.4001 0.59990.3999 0.6001

⎣⎢

⎦⎥

P(4) = 0.4000 0.60000.4000 0.6000

⎣⎢

⎦⎥

Page 13: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.3 Markov Chains (7/9)

Figure 11.3: State-transition diagram of Markov chain for Example 2.

(c) 2017 Biointelligence Lab, SNU 13!!

P=

0 0 113

16

12

34

14 0

⎢⎢⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥⎥⎥

!

π1 =13π2 +

34π3

π2 =16π2 +

14π3

π3 =π1 +12π2

π j = π i piji=1

K∑

!

π1 =0.3953π2 =0.1395π3 =0.4652

Page 14: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.3 Markov Chains (8/9)

Figure 11.4: Classification of the states of a Markov chain and their associated long-term behavior.

14

Page 15: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.3 Markov Chains (9/9)

(c) 2017 Biointelligence Lab, SNU 15

Principle of detailed balance At thermal equilibrium, the rate of occurrence of any transition equals the corresponding rate of occurrence of the inverse transition, π i pij = π j p ji

Application : stationary distribution

π i piji=1

K

∑ =π iπ j

pij⎛

⎝⎜

⎠⎟

i=1

K

∑ π j =π j

π j

p ji⎛

⎝⎜

⎠⎟

i=1

K

∑ π j

= p ji( )i=1

K

∑ π j (π i pij = π j p ji ,detailed balance)

= π j (since p jii=1

K

∑ =1)

Page 16: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.4 Metropolis Algorithm (1/3)

(c) 2017 Biointelligence Lab, SNU 16

Metropolis Algorithm A stochastic algorithm for simulating the evolution of a physical system to thermal equilibrium. A modified Monte Carlo method. Markov Chain Monte Carlo (MCMC) method

Algorithm Metropolis1. Xn = xi . Randomly generate a new state x j .

2. ΔE = E(x j )− E(xi )

3. If ΔE < 0, then Xn+1 = x j

else if ΔE ≥ 0, then { Select a random number ξ ∼ U[0,1]. If ξ < exp(−ΔE / T ), then Xn+1 = x j , (accept)

else Xn+1 = xi . (reject)

}

Page 17: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.4 Metropolis Algorithm (2/3)

(c) 2017 Biointelligence Lab, SNU 17

Choice of Transition ProbabilitiesProposed set of transition probabilities 1. τ ij > 0 (for all i, j) : Nonnegativity

2. τ ijj∑ =1 (for all i) : Normalization

3. τ ij = τ ji (for all i, j) : Symmetry

Desired set of transition probabilities

pij =τ ij

π j

π i

⎝⎜

⎠⎟ for

π j

π i<1

τ ij forπ j

π i≥1

⎪⎪⎪

⎪⎪⎪

pii = τ ii + τ ij 1−π j

π i

⎝⎜

⎠⎟ =1− α ijτ ijj≠i∑j≠i∑

Moving probability

α ij = min 1,π j

π i

⎝⎜⎞

⎠⎟

Page 18: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.4 Metropolis Algorithm (3/3)

(c) 2017 Biointelligence Lab, SNU 18

How to choose the ratio π j / π i ?

We choose the probability distribution to which we want the Markov chain to coverge to be a Gibbs distribution

π j =1Z

exp −E j

T⎛

⎝⎜⎞

⎠⎟

π j

π i

= exp − ΔET

⎛⎝⎜

⎞⎠⎟

ΔE = E j − Ei

Proof of detailed balance :Case 1: ΔE < 0. π i pij = π iτ ij = π iτ ji

π j p ji = π j

π i

π j

τ ji

⎝⎜

⎠⎟ = π iτ ji

Case 2: ΔE > 0.

π i pij = π i

π j

π i

τ ij

⎝⎜⎞

⎠⎟= π jτ ji

π j p ji = π iτ ij

Page 19: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.5 Simulated Annealing (1/3)

(c) 2017 Biointelligence Lab, SNU 19

Simulated Annealing• A stochastic relaxation technique for solving optimization problems. • Improves the computational efficiency of the Metropolis algorithm. • Makes random moves on the energy surface

• Operate a stochastic system at a high temperature (where convergence to equilibrium is fast) and then iteratively lower the temperature (at T=0, the Markov chain collapses on the global minima).

Two ingredients:1. A schedule that determines the rate at which the temperature is lowered.2. An algorithm, such as the Metropolis algorithm, that iteratively finds the

equilibrium distribution at each new temperature in the schedule by using the final state of the system at the previous temperature as the starting point for the new temperature.

!!F = ! < E > −TH , !!!!!! limT→0

!F ! = ! < E >

Page 20: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.5 Simulated Annealing (2/3)

(c) 2017 Biointelligence Lab, SNU 20

1. Initial Value of the Temperature. The initial value T0 of the temperature is chosen high enough to ensure that virtually all proposed transitions are accepted by the simulated-annealing algorithm

2. Decrement of the Temperature. Ordinarily, the cooling is performed exponentially, and the changes made in the value of the temperature are small. In particular, the decrement function is defined by

where α is a constant smaller than, but close to, unity. Typical values of αlie between 0.8 and 0.99. At each temperature, enough transitions are attempted so that there are 10 accepted transitions per experiment, on average.

3. Final Value of the Temperature. The system is fixed and annealing stops if the desired number of acceptances is not achieved at three successive temperatures

Tk =αTk−1, k = 1,2,…, K

Page 21: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.5 Simulated Annealing (3/3)

Simulated Annealing for Combinatorial Optimization

(c) 2017 Biointelligence Lab, SNU 21

Page 22: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.6 Gibbs Sampling (1/2)

22

Gibbs sampling An iterative adaptive scheme that generates a single value for the conditional distribution for each component of the random vector X , rather than all values of the variables at the same time.X = X1, X2 ,..., X K : a random vector of K components

Assume we know P( Xk | X−k ),where X−k = X1, X2 ,..., Xk−1Xk+1,..., X K

Gibbs sampling algorithm (Gibbs sampler)1. Initialize x1(0),x2(0),...,xK (0).

2. i ←1 x1(1) ∼ P( X1 | x2(0),x3(0),x4(0),...,xK (0))

x2(1) ∼ P( X2 | x1(1),x3(0),x4(0),...,xK (0))

x3(1) ∼ P( X3 | x1(1),x2(1),x3(0),...,xK (0))

" xk (1) ∼ P( Xk | x1(1),x2(1),...,xk−1(1),xk+1(0),xK (0))

" xK (1) ∼ P( X K | x1(1),x2(1),...,xK−1(1))

3. If (termination condition not met), then i ← i +1 and go to step 2.

Page 23: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.6 Gibbs Sampling (2/2)

(c) 2017 Biointelligence Lab, SNU 23

1. Convergence theorem. The random variable Xk (n) converges in distribution to the true

probability distributions of Xk for k = 1, 2, ..., K as n approaches infinity; that is,

limn→∞

P( Xk(n) ≤ x | xk (0)) = PXk

(x) for k = 1,2,…, K

where PXk(x) is marginal cumulative distribution function of Xk .

2. Rate-of-convergence theorem. The joint cumulative distribution of the random variables X1(n), X2(n), ..., X K (n) converges to the true joint cumulative distribution

of X1, X2 , ..., X K at a geometric rate in n.

3. Ergodic theorem. For any measurable function g of the random variables X1, X2 , ..., X K whose expectation exists, we have

limn→∞

1n

g( X1(i), X2(i),…, X K (i))→ E[g( X1, X2 ,…X K )]i=1

n∑ with probability 1 (i.e., almost surely).

Page 24: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.7 Boltzmann Machine (1/5)

Figure 11.5: Architectural graph of Boltzmann machine; K is the number of visible neurons, and L is the number of hidden neurons. The distinguishing features of the machine are:1. The connections between the visible and hidden neurons are symmetric.2. The symmetric connections are extended to the visible and hidden neurons.

24

Boltzmann machine (BM)x : state vector of BMwji : synaptic connection from i to j

Structure (weights) wji = wij ∀i, j

wii = 0 ∀i

Energy

E(x) = − 12

wjixix jj≠i∑i∑Probability

P(X = x) = 1Z

exp − E(x)T

⎛⎝⎜

⎞⎠⎟

A stochastic machine consisting of stochastic neurons with symmetric synaptic connections.

Page 25: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.7 Boltzmann Machine (2/5)

(c) 2017 Biointelligence Lab, SNU 25

Consider three events: A : X j = x j

B : Xi = xi{ }i=1

Kwith i ≠ j

C : Xi = xi{ }i=1

K

The joint event B excludes A, and the joint event C includes both A and B.P(C) = P( A, B)

= 1Z

exp1

2Twjixix jj , j≠i∑i∑⎛

⎝⎜⎞⎠⎟

P(B) = P( A, B)A∑

= 1Z

exp1

2Twjixix jj , j≠i∑i∑⎛

⎝⎜⎞⎠⎟x j

The component involving x j

x j

2Twjixi

i≠ j∑

P( A | B) = P( A, B)P(B)

= 1

1+ exp −x j

Twjixi

ii≠ j

∑⎛

⎜⎜

⎟⎟

P X j = x | Xi = xi{ }i=1,i≠ j

K( ) =ϕ xT

wjixii,i≠ j

K∑⎛⎝⎜

⎞⎠⎟

ϕ(v) = 11+ exp(−v)

Page 26: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.7 Boltzmann Machine (3/5)

Figure 11.6: Sigmoid-shaped function P(v).

(c) 2017 Biointelligence Lab, SNU 26

L(w) = log P(Xα = xα )xα∈ℑ

∏ = log P(Xα = xα )

xα∈ℑ∑

1. Positive phase. In this phase, the network operates in its clamped condition (i.e.,under the direct influence of the training sample ).2. Negative phase. In this second phase, the network is allowed to run freely, and therefore with no environmental input.

𝕵

Page 27: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.7 Boltzmann Machine (4/5)

(c) 2017 Biointelligence Lab, SNU 27

xα : the state of the visible neurons (subset of x)

xβ : the state of the hidden neurons (subset of x)

Probability of the visible state

P(Xα = xα ) = 1Z

exp − E(x)T

⎛⎝⎜

⎞⎠⎟xβ

Z = exp − E(x)T

⎛⎝⎜

⎞⎠⎟x∑

Log-likelihood function given the training data ℑ

L(w) = log P(x | w) = log exp − E(x)T

⎛⎝⎜

⎞⎠⎟xβ

∑ − log exp − E(x)T

⎛⎝⎜

⎞⎠⎟x∑⎛

⎝⎜⎞⎠⎟xα∈ℑ

∑Derivative of the log-likelihood function

∂L(w)∂wji

= 1T

P(Xβ = xβ | Xα = xα )xβ

∑ x jxi − P(X = x)x jxix∑( )xα∈ℑ∑

Page 28: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.7 Boltzmann Machine (5/5)

(c) 2017 Biointelligence Lab, SNU 28

Mean firing rate in the positive phase (clamped)

ρ ji+ = x jxi

+= P(X = xβ |X = xα )x jxixβ

∑xα∈ℑ∑

Mean firing rate in the negative phase (free-running)

ρ ji− = x jxi

−= P(X = x)x jxix∑xα∈ℑ∑

Thus, we may write

∂L(w)∂wji

= 1T(ρ ji

+ − ρ ji− )

Gradient ascent to maximize the L(w)

Δwji =η∂L(w)∂wji

=η '(ρ ji+ − ρ ji

− )

η ' = εT

Boltzmann machine learning rule

Page 29: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.8 Logistic Belief Nets

Figure 11.7: Directed (logistic) belief network.(c) 2017 Biointelligence Lab, SNU 29

Parents of node j

pa( X j )⊆ X1, X2 ,…, X j−1{ }Conditional probability P( X j = x j | X1 = x1,…, X j−1 = x j−1)

= P( X j = x j | pa( X j ))

A stochastic machine consisting of multiple layers of stochastic neurons with directed synaptic connections.

Calculation of conditional probabilities 1. wji = 0 for all Xi ∉ pa(X j )

2. wji = 0 for i ≥ j (∵acyclic)

Weight update rule

Δwji =η∂

∂wji

L(w)

Page 30: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.9 Deep Belief Nets (1/4)

Figure 11.8: Neural structure of restricted Boltzmann machine (RBM). Contrasting this with that of Fig. 11.6, we see that unlike the Boltzmann machine, there are no connections among the visible neurons and the hidden neurons in the RBM.

(c) 2017 Biointelligence Lab, SNU 30

Maximum-Likelihood Learning in a Restricted Boltzmann Machine (RBM)

Sequential pre - training1. Update the hidden states h in parallel, given the visible states x.2. Doing the same, but in reverse: update the visible states x in parallel, given the hidden states h.Maximum - likelihood learning

∂L(w)∂wji

= ρ ji(0) − ρ ji

(∞)

Page 31: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.9 Deep Belief Nets (2/4)

Figure 11.9: Top-down learning, using logistic belief network of infinite depth.

31

Figure 11.10: A hybrid generative model in which the two top layers form a restricted Boltzmann machine and the lower two layers form a directed model. The weights shown with blue shaded arrows are not part of the generative model; they are used to infer the feature values given to the data, but they are not used for generating data.

Page 32: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.9 Deep Belief Nets (3/4)

Figure 11.11: Illustrating the progression of alternating Gibbs sampling in an RBM. After sufficiently many steps, the visible and hidden vectors are sampled from the stationary distribution defined by the current parameters of the model.

32(c) 2017 Biointelligence Lab, SNU

Page 33: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.9 Deep Belief Nets (4/4)

Figure 11.12: The task of modeling the sensory (visible) data is divided into two subtasks.

33(c) 2017 Biointelligence Lab, SNU

Page 34: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.10 Deterministic Annealing (1/5)

(c) 2017 Biointelligence Lab, SNU 34

Deterministic Annealing Incorporates randomness into the energy function, which is then deterministically optimized at a sequence of decreasing temperature (cf. simulated annealing: random moves on the energy surface)Clustering via Deterministic Annealing x : source (input) vector y : reconstruction (output) vector

Distortion measure: d(x,y) = x − y2

Expected distortion: D = P(X = x,Y = y)d(x,y)y∑x∑

= P(X = x) P(Y = y | X = x)d(x,y)y∑x∑

Probability of joint event P(X = x,Y = y) = P(Y = y | X = x)

association probability! "## $## P(X = x)

Page 35: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.10 Deterministic Annealing (2/5)

Table 11.2

(c) 2017 Biointelligence Lab, SNU 35

Entropy as randomness measure H (X,Y) = − P(X = x,Y = y)logP(X = x,Y = y)

y∑x∑Constrained optimization of D as minimization of the Lagrangean F = D −TH H (X,Y) = H (X)

source entropy! + H (Y |X)

conditional entropy!"# $#

H (Y |X) = − P(X = x) P(Y = y |X = x)logP(Y = y |X = x)y∑x∑

P(Y = y |X = x) = 1Zx

exp − d(x,y)T

⎛⎝⎜

⎞⎠⎟

, Zx = exp − d(x,y)T

⎛⎝⎜

⎞⎠⎟y∑

Page 36: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.10 Deterministic Annealing (3/5)

(c) 2017 Biointelligence Lab, SNU 36

F * = minP(Y=y|X=x )

F

= −T P(X = x) log Zxx∑Setting

∂∂y

F * = P(X = x,Y = y)∂∂y

d(x,y)x∑ = 0 ∀y ∈ϒ

The minimizing condition is

1N

P(Y = y | X = x)x∑ ∂

∂yd(x,y) = 0 ∀y ∈ϒ

The deterministic annealing algorithm consists of minimizing the Lagrangian F * with respect to the code vectors at a high value of temperature T and then tracking the minimum while the temperature T is lowered.

Page 37: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.10 Deterministic Annealing (4/5)

Figure 11.13: Clustering at various phases. The lines are equiprobability contours, p = ½ in (b), and p = ⅓ elsewhere:(a) 1 cluster (B = 0),(b) 2 clusters (B = 0.0049),(c) 3 clusters (B = 0.0056),(d) 4 clusters (B = 0.0100),(e) 5 clusters (B = 0.0156),(f) 6 clusters (B = 0.0347), and(g) 19 clusters (B = 0.0605).

37(c) 2017 Biointelligence Lab, SNU

!!B = 1

T

Page 38: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.10 Deterministic Annealing (5/5)

Figure 11.14: Phase diagram for the Case Study in deterministic annealing. The number of effective clusters is shown for each phase.

38(c) 2017 Biointelligence Lab, SNU!!B = 1

T

Page 39: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.11 Analogy of DA with EM (1/2)

(c) 2017 Biointelligence Lab, SNU 39

Suppose we view the association probability P(Y = y | X = x)as the expected value of a random binary variable Vxy defined as

Vxy

1 if thesource vector x isassigned tocode vector y0 otherwise

⎧⎨⎪

⎩⎪

Then, the two steps of DA = two steps of EM1. Step 1 of DA (= E-step of EM) Compute the association probabilities P(Y = y | X = x) 2. Step 2 of DA (= M-step of EM) Optimize the distortion measure d(x,y)

Page 40: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

11.11 Analogy of DA with EM (2/2)

(c) 2017 Biointelligence Lab, SNU 40

r : complete data including missing data z d = d(r) : incomplete dataConditional pdf of r given param vector θ

pD (d |θ) = pc (r |θ)drℜ(d )∫

ℜ(d ) : subspace of ℜ determined by d = d(r)Incomplete log-likelihood function L(θ) = log pD (d |θ)Complete-data log-likelihood function Lc (θ) = log pc (r |θ)

Expectation - Maximization Algorithm

θ̂(n) : value of θ at iteration n of EM 1. E-step

Q(θ, θ̂(n)) = Eθ̂(n)

LC (θ)⎡⎣ ⎤⎦2. M-step

θ̂(n +1) = arg maxθQ(θ, θ̂(n))

After an interation of the EM algorithm, the incomplete-data log-likelihood function is not decreased:

L(θ̂(n +1) ≥ L(θ̂(n)) for n = 0,1,2,…,K

Page 41: Chapter 11. Stochastic Methods Rooted in Statistical Mechanicsscai/Courses/ann19f/slides/Ch11_SM.pdf · 2019-10-01 · Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Summary and Discussionn Statistical mechanics as mathematical basis for the

formulation of stochastic simulation / optimization / learning1. Metropolis algorithm2. Simulated annealing3. Gibbs sampling

n Stochastic learning machines1. (Classical) Boltzmann machine2. Restricted Boltzmann machine (RBM)3. Deep belief nets (DBN)

n Deterministic annealing (DA) 1. For optimization: Connection to simulated annealing (SA)2. For clustering: Connection to expectation-maximization (EM)

(c) 2017 Biointelligence Lab, SNU 41