slides_lecture6

Embed Size (px)

Citation preview

  • 8/7/2019 slides_lecture6

    1/12

    Simulated Annealing

    input : (x1, t1), . . . , (xN, tN) Rd {1, +1}; Tstart, Tstop R

    output: w

    beginRandomly initialize wT Tstartrepeatw N(w) //neighbors of w, e.g. by adding

    Gaussion noise (N(0, ))if E(

    w) < E(w) then w

    w

    else if expE(bw)E(w)T > rand[0, 1) thenw

    w

    decrease (T)until T < Tstopreturn w

    end p. 156

  • 8/7/2019 slides_lecture6

    2/12

    Continuous Hopfield Network

    Let us consider our previously defined Hopfield network(identical architecture and learning rule), however withfollowing activity rule

    Si = tanh

    1

    Tj

    wijSj

    Start with a large (temperature) value of T and decrease it by

    some magnitude whenever a unit is updated (deterministicsimulated annealing).

    This type of Hopfield network can approximate the probabilitydistribution

    P(x|W) =1

    Z(W)exp[E(x)] =

    1

    Z(W)exp

    12

    xTWx

    p. 157

  • 8/7/2019 slides_lecture6

    3/12

    Continuous Hopfield Network

    Z(W) =x

    exp(E(x)) (sum over all possible states)

    is the partition function and ensures P(x|W) is a probabilitydistribution.

    Idea: construct a stochastic Hopfield network thatimplements the probability distribution P(x|W).

    Learn a model that is capable of generating patternsfrom that unknown distribution.

    Quantify (classify) by means of probabilities seen andunseen patterns.

    If needed, we can generate more patterns (generative

    model).

    p. 158

  • 8/7/2019 slides_lecture6

    4/12

    Boltzmann Machines

    Given patterns {x(n)}N1 , we want to learn the weights suchthat the generative model

    P(x|W) = 1Z(W) exp12 xTWxis well matched to those patterns. The states are updated

    according to the stochastic rule: set xn = +1 with probability

    11+exp (2

    Pjwijxj)

    else set xn = 1.

    Posterior probability of the weights given the data (Bayestheorem)

    P(W|{x(n)}N1 ) =Nn=1 P(x(n)|W)P(W)

    P({x(n)}N1 ) p. 159

  • 8/7/2019 slides_lecture6

    5/12

    Boltzmann Machines

    Apply maximum likelihood method on the first term innumerator:

    ln Nn=1

    P(x(n)|W) = Nn=1

    12 x(n)TWx(n) ln Z(W)Taking derivative of the log likelihood gives: note that W is

    symmetric (wij = wji) that is wij12x

    (n)TWx(n) = x(n)i x(n)j

    and

    wijln Z(W) =

    1

    Z(w) x

    wijexp

    1

    2

    x(n)T

    Wx(n)=

    1

    Z(W) xexp

    1

    2x(n)

    T

    Wx(n)

    xixj

    = x

    xixj P(x|W) = xixjP(x|W) p. 160

  • 8/7/2019 slides_lecture6

    6/12

    Boltzmann Machines (cont.)

    wijln P({x(n)}N1 |W) =

    N

    n=1 x(n)i x

    (n)j xixjP(x|W)

    = NxixjData xixjP(x|W)Empirical correlation between xi and xj

    xixjData 1

    N

    N

    n=1 x(n)i x

    (n)j

    Correlation between xi and xj under the current model

    xixjP(x|W) x xixjP(x|W) p. 161

  • 8/7/2019 slides_lecture6

    7/12

    Interpretation of Boltzmann Machines Learning

    Illustrative description (MacKays book, pp. 523):

    Awake state: measure correlation between xi and xj in

    the real world, and increase the weights in proportion tothe measured correlations.

    Sleep state: dream about the world using the

    generative model P(x|W) and measure the correlationbetween xi and xj in the model world. Use thesecorrelations to determine a proportional decrease in theweights.

    If correlations in dream world and real world are matching,then the two terms balanced and weights do not change.

    p. 162

  • 8/7/2019 slides_lecture6

    8/12

    Boltzmann Machines with Hidden Units

    To model higher order correlations hidden units are required. x: states of visible units,

    h: states of hidden units,

    generic state of a unit (either visible or hidden) by yi,with y (x, h),

    state of network when visible units are clamped in state

    x(n) is y(n) (x(n), h).

    Probability of W given a single pattern x(n) is

    P(x(n)|W) = h

    P(x(n), h|W) = h

    1Z(W)

    exp12

    y(n)T

    Wy(n)where Z(W) =

    x,h

    exp12 yTWy p. 163

  • 8/7/2019 slides_lecture6

    9/12

    Boltzmann Machines with Hidden Units (cont.)

    Applying the maximum likelihood method as before oneobtains

    wij

    ln P({x(n)}N1 |W) =n

    yiyjP(h|x(n),W)

    clamped to x(n)

    yiyjP(h|x,W)

    free

    Term yiyjP(h|x(n),W) is the correlation between yi and yj

    when Boltzmann machine is simulated with visible variablesclamped to x(n) and hidden variables freely sampling from

    their conditional distribution.

    Term yiyjP(h|x,W) is the correlation between yi and yj

    when the Boltzmann machine generates samples from itsmodel distribution. p. 164

  • 8/7/2019 slides_lecture6

    10/12

    Boltzmann Machines with Input-Hidden-Output

    The so far considered Boltzmann machine is a powerfulstochastic Hopfield network with no ability to performclassification. Let us introduce visible input and output units:

    x (xi, xo)Note that pattern x(n) consists of an input and output part,

    that is, x(n) x(n)i , x

    (n)o .

    nyiyjP(h|x(n),W)

    clamped to (x(n)i

    ,x(n)o )

    yiyjP(h|x,W) clamped to x

    (n)i

    p. 165

  • 8/7/2019 slides_lecture6

    11/12

    Boltzmann Machines Updates Weights

    Combine gradient descent and simulated annealing to updateweights

    wij =

    T

    yiyjP(h|x(n),W)

    clamped to (x(n)i

    ,x(n)o )

    yiyjP(h|x,W)

    clamped to x(n)i

    High computational complexity:

    present each pattern several times

    anneal several timesMean-field version of Boltzmann learning:

    calculate approximations of the correlations ([yiyj ])

    entering the gradient

    p. 166

  • 8/7/2019 slides_lecture6

    12/12

    Deterministic Boltzmann Learning

    input : {x(n)}

    N1 ; , Tstart, Tstop R

    output: W

    beginT Tstart

    repeat

    randomly select pattern from sample {x(n)}N1randomize states

    anneal network with input and output clamped

    at final, low T, calculate [yiyj ]xi,xo clamped

    randomize states

    anneal network with input clamped but output free

    at final, low T, calculate [yiyj ]xi clamped

    wij wij + /T[yiyj ]

    xi,xo clamped] [yiyj ]xi clamped

    until T < Tstop

    return wend

    p. 167