slides_lecture6

8/7/2019 slides_lecture6

1/12

Simulated Annealing

input : (x1, t1), . . . , (xN, tN) Rd {1, +1}; Tstart, Tstop R

output: w

beginRandomly initialize wT Tstartrepeatw N(w) //neighbors of w, e.g. by adding

Gaussion noise (N(0, ))if E(

w) < E(w) then w

w

else if expE(bw)E(w)T > rand[0, 1) thenw

w

decrease (T)until T < Tstopreturn w

end p. 156


2/12

Continuous Hopfield Network

Let us consider our previously defined Hopfield network(identical architecture and learning rule), however withfollowing activity rule

Si = tanh

1

Tj

wijSj

Start with a large (temperature) value of T and decrease it by

some magnitude whenever a unit is updated (deterministicsimulated annealing).

This type of Hopfield network can approximate the probabilitydistribution

P(x|W) =1

Z(W)exp[E(x)] =

1

Z(W)exp

12

xTWx

p. 157


3/12

Continuous Hopfield Network

Z(W) =x

exp(E(x)) (sum over all possible states)

is the partition function and ensures P(x|W) is a probabilitydistribution.

Idea: construct a stochastic Hopfield network thatimplements the probability distribution P(x|W).

Learn a model that is capable of generating patternsfrom that unknown distribution.

Quantify (classify) by means of probabilities seen andunseen patterns.

If needed, we can generate more patterns (generative

model).

p. 158


4/12

Boltzmann Machines

Given patterns {x(n)}N1 , we want to learn the weights suchthat the generative model

P(x|W) = 1Z(W) exp12 xTWxis well matched to those patterns. The states are updated

according to the stochastic rule: set xn = +1 with probability

11+exp (2

Pjwijxj)

else set xn = 1.

Posterior probability of the weights given the data (Bayestheorem)

P(W|{x(n)}N1 ) =Nn=1 P(x(n)|W)P(W)

P({x(n)}N1 ) p. 159


5/12

Boltzmann Machines

Apply maximum likelihood method on the first term innumerator:

ln Nn=1

P(x(n)|W) = Nn=1

12 x(n)TWx(n) ln Z(W)Taking derivative of the log likelihood gives: note that W is

symmetric (wij = wji) that is wij12x

(n)TWx(n) = x(n)i x(n)j

and

wijln Z(W) =

1

Z(w) x

wijexp

1

2

x(n)T

Wx(n)=

1

Z(W) xexp

1

2x(n)

T

Wx(n)

xixj

= x

xixj P(x|W) = xixjP(x|W) p. 160


6/12

Boltzmann Machines (cont.)

wijln P({x(n)}N1 |W) =

N

n=1 x(n)i x

(n)j xixjP(x|W)

= NxixjData xixjP(x|W)Empirical correlation between xi and xj

xixjData 1

N

N

n=1 x(n)i x

(n)j

Correlation between xi and xj under the current model

xixjP(x|W) x xixjP(x|W) p. 161


7/12

Interpretation of Boltzmann Machines Learning

Illustrative description (MacKays book, pp. 523):

Awake state: measure correlation between xi and xj in

the real world, and increase the weights in proportion tothe measured correlations.

Sleep state: dream about the world using the

generative model P(x|W) and measure the correlationbetween xi and xj in the model world. Use thesecorrelations to determine a proportional decrease in theweights.

If correlations in dream world and real world are matching,then the two terms balanced and weights do not change.

p. 162


8/12

Boltzmann Machines with Hidden Units

To model higher order correlations hidden units are required. x: states of visible units,

h: states of hidden units,

generic state of a unit (either visible or hidden) by yi,with y (x, h),

state of network when visible units are clamped in state

x(n) is y(n) (x(n), h).

Probability of W given a single pattern x(n) is

P(x(n)|W) = h

P(x(n), h|W) = h

1Z(W)

exp12

y(n)T

Wy(n)where Z(W) =

x,h

exp12 yTWy p. 163


9/12

Boltzmann Machines with Hidden Units (cont.)

Applying the maximum likelihood method as before oneobtains

wij

ln P({x(n)}N1 |W) =n

yiyjP(h|x(n),W)

clamped to x(n)

yiyjP(h|x,W)

free

Term yiyjP(h|x(n),W) is the correlation between yi and yj

when Boltzmann machine is simulated with visible variablesclamped to x(n) and hidden variables freely sampling from

their conditional distribution.

Term yiyjP(h|x,W) is the correlation between yi and yj

when the Boltzmann machine generates samples from itsmodel distribution. p. 164


10/12

Boltzmann Machines with Input-Hidden-Output

The so far considered Boltzmann machine is a powerfulstochastic Hopfield network with no ability to performclassification. Let us introduce visible input and output units:

x (xi, xo)Note that pattern x(n) consists of an input and output part,

that is, x(n) x(n)i , x

(n)o .

nyiyjP(h|x(n),W)

clamped to (x(n)i

,x(n)o )

yiyjP(h|x,W) clamped to x

(n)i

p. 165


11/12

Boltzmann Machines Updates Weights

Combine gradient descent and simulated annealing to updateweights

wij =

T

yiyjP(h|x(n),W)

clamped to (x(n)i

,x(n)o )

yiyjP(h|x,W)

clamped to x(n)i

High computational complexity:

present each pattern several times

anneal several timesMean-field version of Boltzmann learning:

calculate approximations of the correlations ([yiyj ])

entering the gradient

p. 166


12/12

Deterministic Boltzmann Learning

input : {x(n)}

N1 ; , Tstart, Tstop R

output: W

beginT Tstart

repeat

randomly select pattern from sample {x(n)}N1randomize states

anneal network with input and output clamped

at final, low T, calculate [yiyj ]xi,xo clamped

randomize states

anneal network with input clamped but output free

at final, low T, calculate [yiyj ]xi clamped

wij wij + /T[yiyj ]

xi,xo clamped] [yiyj ]xi clamped

until T < Tstop

return wend

p. 167

Documents

slides_lecture6