Upload
tudor-constantin
View
225
Download
0
Embed Size (px)
Citation preview
8/7/2019 slides_lecture6
1/12
Simulated Annealing
input : (x1, t1), . . . , (xN, tN) Rd {1, +1}; Tstart, Tstop R
output: w
beginRandomly initialize wT Tstartrepeatw N(w) //neighbors of w, e.g. by adding
Gaussion noise (N(0, ))if E(
w) < E(w) then w
w
else if expE(bw)E(w)T > rand[0, 1) thenw
w
decrease (T)until T < Tstopreturn w
end p. 156
8/7/2019 slides_lecture6
2/12
Continuous Hopfield Network
Let us consider our previously defined Hopfield network(identical architecture and learning rule), however withfollowing activity rule
Si = tanh
1
Tj
wijSj
Start with a large (temperature) value of T and decrease it by
some magnitude whenever a unit is updated (deterministicsimulated annealing).
This type of Hopfield network can approximate the probabilitydistribution
P(x|W) =1
Z(W)exp[E(x)] =
1
Z(W)exp
12
xTWx
p. 157
8/7/2019 slides_lecture6
3/12
Continuous Hopfield Network
Z(W) =x
exp(E(x)) (sum over all possible states)
is the partition function and ensures P(x|W) is a probabilitydistribution.
Idea: construct a stochastic Hopfield network thatimplements the probability distribution P(x|W).
Learn a model that is capable of generating patternsfrom that unknown distribution.
Quantify (classify) by means of probabilities seen andunseen patterns.
If needed, we can generate more patterns (generative
model).
p. 158
8/7/2019 slides_lecture6
4/12
Boltzmann Machines
Given patterns {x(n)}N1 , we want to learn the weights suchthat the generative model
P(x|W) = 1Z(W) exp12 xTWxis well matched to those patterns. The states are updated
according to the stochastic rule: set xn = +1 with probability
11+exp (2
Pjwijxj)
else set xn = 1.
Posterior probability of the weights given the data (Bayestheorem)
P(W|{x(n)}N1 ) =Nn=1 P(x(n)|W)P(W)
P({x(n)}N1 ) p. 159
8/7/2019 slides_lecture6
5/12
Boltzmann Machines
Apply maximum likelihood method on the first term innumerator:
ln Nn=1
P(x(n)|W) = Nn=1
12 x(n)TWx(n) ln Z(W)Taking derivative of the log likelihood gives: note that W is
symmetric (wij = wji) that is wij12x
(n)TWx(n) = x(n)i x(n)j
and
wijln Z(W) =
1
Z(w) x
wijexp
1
2
x(n)T
Wx(n)=
1
Z(W) xexp
1
2x(n)
T
Wx(n)
xixj
= x
xixj P(x|W) = xixjP(x|W) p. 160
8/7/2019 slides_lecture6
6/12
Boltzmann Machines (cont.)
wijln P({x(n)}N1 |W) =
N
n=1 x(n)i x
(n)j xixjP(x|W)
= NxixjData xixjP(x|W)Empirical correlation between xi and xj
xixjData 1
N
N
n=1 x(n)i x
(n)j
Correlation between xi and xj under the current model
xixjP(x|W) x xixjP(x|W) p. 161
8/7/2019 slides_lecture6
7/12
Interpretation of Boltzmann Machines Learning
Illustrative description (MacKays book, pp. 523):
Awake state: measure correlation between xi and xj in
the real world, and increase the weights in proportion tothe measured correlations.
Sleep state: dream about the world using the
generative model P(x|W) and measure the correlationbetween xi and xj in the model world. Use thesecorrelations to determine a proportional decrease in theweights.
If correlations in dream world and real world are matching,then the two terms balanced and weights do not change.
p. 162
8/7/2019 slides_lecture6
8/12
Boltzmann Machines with Hidden Units
To model higher order correlations hidden units are required. x: states of visible units,
h: states of hidden units,
generic state of a unit (either visible or hidden) by yi,with y (x, h),
state of network when visible units are clamped in state
x(n) is y(n) (x(n), h).
Probability of W given a single pattern x(n) is
P(x(n)|W) = h
P(x(n), h|W) = h
1Z(W)
exp12
y(n)T
Wy(n)where Z(W) =
x,h
exp12 yTWy p. 163
8/7/2019 slides_lecture6
9/12
Boltzmann Machines with Hidden Units (cont.)
Applying the maximum likelihood method as before oneobtains
wij
ln P({x(n)}N1 |W) =n
yiyjP(h|x(n),W)
clamped to x(n)
yiyjP(h|x,W)
free
Term yiyjP(h|x(n),W) is the correlation between yi and yj
when Boltzmann machine is simulated with visible variablesclamped to x(n) and hidden variables freely sampling from
their conditional distribution.
Term yiyjP(h|x,W) is the correlation between yi and yj
when the Boltzmann machine generates samples from itsmodel distribution. p. 164
8/7/2019 slides_lecture6
10/12
Boltzmann Machines with Input-Hidden-Output
The so far considered Boltzmann machine is a powerfulstochastic Hopfield network with no ability to performclassification. Let us introduce visible input and output units:
x (xi, xo)Note that pattern x(n) consists of an input and output part,
that is, x(n) x(n)i , x
(n)o .
nyiyjP(h|x(n),W)
clamped to (x(n)i
,x(n)o )
yiyjP(h|x,W) clamped to x
(n)i
p. 165
8/7/2019 slides_lecture6
11/12
Boltzmann Machines Updates Weights
Combine gradient descent and simulated annealing to updateweights
wij =
T
yiyjP(h|x(n),W)
clamped to (x(n)i
,x(n)o )
yiyjP(h|x,W)
clamped to x(n)i
High computational complexity:
present each pattern several times
anneal several timesMean-field version of Boltzmann learning:
calculate approximations of the correlations ([yiyj ])
entering the gradient
p. 166
8/7/2019 slides_lecture6
12/12
Deterministic Boltzmann Learning
input : {x(n)}
N1 ; , Tstart, Tstop R
output: W
beginT Tstart
repeat
randomly select pattern from sample {x(n)}N1randomize states
anneal network with input and output clamped
at final, low T, calculate [yiyj ]xi,xo clamped
randomize states
anneal network with input clamped but output free
at final, low T, calculate [yiyj ]xi clamped
wij wij + /T[yiyj ]
xi,xo clamped] [yiyj ]xi clamped
until T < Tstop
return wend
p. 167