Deep Belief Networks Learning Advanced Machines319655500.online.de/du/upload/DeepBeliefLearning.pdf · "Deep Belief Networks" The "magic" behind will be explained in this talk Developed

Advanced Machine Learning / Deep Belief NetworksDaniel Ulbricht

Agenda

● Short History in Machine Learning● What's Deep Learning?● Derive Learning in General● Energy Based Models / Restricted

Boltzmann Machines● Bring things together

● You will Implement a core algorithm to apply learning

History: 1th wave

"The Perceptron"Frank Rosenblatt explained the Perceptron in the year 1958

● Simple problems like XOR could be solved

History: 2nd wave

"Backpropagation"

Developed by Paul Werbos in 1974● Complex non-linear problems could be

solved

History: 3rd wave

"Deep Belief Networks"

The "magic" behind will be explained in this talk

Developed mainly 2006 by Geoff Hinton

History: Deep Learning

● An automatic way to learn representations (descriptors) from given data

● Attempt to learn multiple levels of representations on increasing complexity

Lent from Andrew Ng

History: Deep Learning

● BackPropagation is already an attempt to perform Deep Learning

● But there are some problems○ Gradient is progressively getting diluted○ Initialization of weights○ How to label all the given data

Decreasing update strength

Machine Learning in General

Input

Output

Goal:Find weights W to maximize the probability of a certain outputs given some input vectors

Maximize:

Weights

Machine Learning in General

Learning can be performed using:● Gradient Ascent on: log P● Gradient Descent on: - log P

From the optimization theory we know many downhill optimization algorithms● (Stochastic) Gradient Descent● Conjugant Gradient● Dog leg

Max the Log likelihood of System

Log likelihood of Data

Average log likelihood per pattern log likelihood of Normalization Term

Rules for Gradient Computation

Sum Rule:

Product Rule:

1:

2:

Gradient of first part

Gradient of first part:

Sum over posterior ;-)

Rule 1

Sum Rule

Reorder

Product Rule

Rule 2

Gradient of second part

Gradient of second part:

Sum over joint

Full Gradient

Full Gradient:

Two averages around the same term

Therefore we can write:

Hebbian /Positive phase

Anti - Hebbian /Negative phase

Gradient in Sigmoid Belief Nets

Apply this knowledge on normal Sigmoid Nets● Used in backpropagation

Joint is automatically normalized:

This leads the second gradient term to be zero:

Full gradient:

The well known: Delta Rule

Energy Based Models (EBM)

Energy Based Probabilistic Models define a probability distribution as follows:

High Probability -> Low EnergyLow Probability -> High Energy-> Minimize the energy

Partition function (Normalization term)

Energy Based Models with Hidden Units

In reality with can't observe the full state of our data and/or we are not aware of indirect influences -> Therefore we add Hidden Units to to increase the expression power of the model.

Restricted Boltzmann Machine

Fancy name for a simple bidirectional graph:● No connection inside the same layer● No loops● Energy function is used to perform

transitions

Alternate Gibbs Sampling

Computing the average of the posterior and the joint is very expensive

To overcome this Gibbs Sampling is used

Gibbs Sampling inside Energy Based Models leads to the simple sigmoid function

Poof is easy to do but would take too longUse the normal Gibbs algorithm and put in the Energy term for the distribution. You can find it on my webpage


Alternate Gibbs Sampling:

● Sample Up (visible to hidden)● Sample Down (Hidden to visible)● continue...


● Running it infinite iterations would give exact gradient (~ Monte Carlo Markov Chain)

● Surprisingly even a single iteration works very well in practice○ Geoff Hinton tried this in 2006 and recognized that

the system converges well even with a single iteration

○ Called Contrastive Divergence


Alternate Gibbs Sampling:

● Start with training vector● Sample Up (visible to hidden)● Sample Down (Hidden to visible)● Sample Up

Hebbian Anti- Hebbian

Bring things together

For simplification we use from now on binary input and output units● Terms get much easier to compute● Its also the common way using in practical

applications


● Hebbian Part (Up Step):Sum over all visible units multiplied with the according hidden unit weight

Sigmoid function - We can use it due to the fact that gibbs sampling inside an EBM is sigmoid

Make the output Stochastic.Simplifies the next steps

bias for hidden unit


● Hebbian Part (Down Step):

Same as Up Step only using a different bias

bias for visible unit


● Anti - Hebbian Part (Up Step):

Instead of using the reconstructed output now the probability is used.


● Full Gradient:

Don't forget the bias:

Average over posterior Average over joint

So Far We Have

● The knowledge to train a Restricted Boltzmann Machine

● No need for labels -> Our labels are the equilibrium level of the Energy Function

Open Question:● How to perform Deep Learning without

the factorial behaviour

Stacking RBM's

To perform Deep Learning we stack multiple RBM's but learn them layer per layer

Input

Stacking RBM's


Input

Hidden

W1

Stacking RBM's


Input

Hidden

W1 <- Fixed (We don't update anymore)

Hidden

W2

Now we have

● A network which learns○ Without labels

■ Labels are the equilibrium level of energy term○ Every layer learns a significant amount

■ Due to be independent from every other layer

Get hands on:

Download the example matlab/octave files from my homepage

You will recognize that calling runRBM● will do nothing so far● It misses the implementation of

"Contrastive Divergence"● Try to implement the "Contrastive

Divergence"○ Solution can be found also on my homepage

Thank you for Listening

Documents

Deep Belief Networks Learning Advanced Machines319655500.online.de/du/upload/DeepBeliefLearning.pdf · "Deep Belief Networks" The "magic" behind will be explained in this talk Developed